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Abstract 



f^ Although it is known that Bayesian estimators may fail to converge or may con- 

OQ verge towards the wrong answer (i.e. be inconsistent) if the probability space is 

\-4, not finite or if the model is misspecified (i.e. the data-generating distribution does 

^^ not belong to the family parametrized by the model), it is also a popular belief 

■^ that a "good" or "close" enough model should have good convergence properties. 

^^ This paper incorporates Bayesian priors into the Optimal Uncertainty Quantifica- 

^S| tion (OUQ) framework [Mi] and in doing so reveals extreme brittleness in Bayesian 

inference. These brittleness results demonstrate that, contrary to popular belief, 
there is no such thing as a "close enough" model in Bayesian inference in the follow- 
ing sense: we derive optimal lower and upper bounds on posterior values obtained 
from models that exactly capture an arbitrarily large (but finite) number of finite- 
dimensional marginals of the data-generating distribution and/or that are arbitrarily 
Cd close to the data-generating distribution in the Prokhorov or total variation metrics; 

these bounds show that such models may still make the largest possible prediction 
error after conditioning on an arbitrarily large number of sample data. Therefore, 
. under model misspecification, and without stronger assumptions than (arbitrary) 

^ closeness in Prokhorov or total variation metrics, Bayesian inference offers no better 

^NJ guarantee of accuracy than arbitrarily picking a value between the essential infimum 

C^^ and supremum of the quantity of interest. In particular, an unscrupulous practi- 

tioner could slightly perturb a given prior and model to achieve any desired posterior 
conclusions. 
^T Finally, this paper also addresses the non-trivial technical questions of how to 

incorporate priors in the OUQ framework. In particular, we develop the necessary 
measure theoretical foundations in the context of Polish spaces, so that simulta- 
• • neously prior measures can be put on subsets of a product space of functions and 

. J^ measures and important quantities of interest are measurable. We also develop the 

S^ reduction theory for optimization problems over measures on product spaces of mea- 

Vh sures and functions, thus laying down the foundations for the scientific computation 

of optimal statistical estimators. 
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1 Introduction 

Throughout science and industry, Bayesian methods are increasingly popular tools for 
the understanding of uncertainty in often complicated contexts, and they impact the 
making of sometimes critical decisions. It is probably fair to say that, despite their 
popularity and documented successes, Bayesian methods have always attracted some 
degree of controversy and opposition: see e.g. [59] and rejoinders for a recent academic 
discussion, and [78, 81] for less formal treatments. Often, this opposition is philosophical 
in nature, particularly with regard to the subjective interpretation of the probabilities 
involved, which is something that remains counter-intuitive to many commentators: see 
[50, par. 35 & 37] for a recent example in law. However, there are also analytical reasons 
to be wary about the application of Bayesian methods: there is now half a century's 



worth of examples of situations in which the Bayesian posterior behaves in apparently 
perverse ways and yields predictions that are, by any objective measure, wrong. 

It is, in fact, now well understood that Bayesian methods may fail to converge or may 
converge towards the wrong solution if the underlying probability mechanism allows an 
infinite number of possible outcomes [42] and that, in these non- finite-probability-space 
situations, this lack of convergence (commonly referred to as Bayesian inconsistency) is 
the rule rather than the exception [ I'S]. Conversely, it is known from the Bernstein-von 
Mises Theorem [24, 76, 110] that consistency (convergence of the Bayesian posterior to 
the data-generating distribution in the limit of observing infinite amounts of sample data) 
does indeed hold, under some regularity conditions, if the data-generating distribution 
belongs to the finite-dimensional family of distributions parametrized by the model (i.e. if 
the model is well specified). 

However, although it is known that this convergence may fail under model misspe- 
ciation [119, 61, 88, 1, 2, 74, 77, 62] (i.e. when the data-generating distribution does 
not belong to the finite-dimensional family of distributions parametrized by the model, 
as illustrated in Figure 2.1) it is also a popular belief that a "close enough" (or "good 
enough") model should have good convergence properties: see e.g. [46, 95, ii]. This be- 
lief echoes G. E. P. Box's statement ["SO, p. 424] that "essentially, all models are wrong, 
but some are useful" and question [ , p. 74] "Remember that all models are wrong; the 
practical question is how wrong do they have to be to not be useful?" 

The brittleness results of this paper (Theorems 5.12, 6.4 and 6.10) show that, contrary 
to this popular belief, there is no such thing as a "close enough" model in Bayesian infer- 
ence in the following sense: suppose that one calculates optimal bounds (i.e. least upper 
and greatest lower bounds) on posterior values with respect to a Bayesian model that 
exactly captures an arbitrarily large (but finite) number of finite-dimensional marginals 
of the data-generating distribution and/or that are arbitrarily close to the true (data 
generating) distribution in the Prokhorov or total variation metrics; our results show 
that such models may still make the largest possible prediction error even after condi- 
tioning on an arbitrarily large number of sample data observations, and also in the limit 
as the number of observations tends to infinity. Therefore, these brittleness theorems 
suggest that, under model misspecification and without stronger assumptions than close- 
ness in the Prokhorov and/or total variation metrics, Bayesian inference offers no better 
guarantee of accuracy than arbitrarily picking a value between the essential infimum and 
supremum of the quantity of interest. In particular, an unscrupulous practitioner can 
slightly perturb a given prior and model to achieve any desired posterior conclusions. 

As noted by G. E. P. Box, for complex systems, all models are misspecified. Indeed, 
for such systems, the data-generating distribution is a point in an infinite dimensional 
space of measures whereas Bayesian models, in their common (parametric) applications, 
form finite-dimensional subspaces of these infinite dimensional spaces of measures. This 
view, combined with our brittleness results, establishes that, in complex systems, al- 
though Bayesian methods may work well, they may also work very poorly. In either 
case, without more information, one will not know. 



1.1 Structure of the paper and main results 

This paper is structured as follows: Section 2 reviews questions of Bayesian consis- 
tency, inconsistency, model misspecification, and robustness through a motivating anal- 
ysis. Section 3 incorporates Bayesian priors into the Optimal Uncertainty Quantification 
(OUQ) framework [86]. In the OUQ framework, Uncertainty Quantification (UQ) is for- 
mulated as an optimization problem (over an infinite-dimensional set of functions and 
measures) corresponding to extremizing (i.e. finding worst and best case scenarios) prob- 
abilities of failure or other quantities of interest, subject to the constraints imposed by 
the scenarios compatible with the assumptions and information. In particular, the OUQ 
framework allows for the treatment of systems of partially-known probability measures 
and response functions; such systems arise naturally in studies of materials, financial sys- 
tems, insurance against catastrophes, medicine and law. This generalization of the OUQ 
framework to Bayesian priors requires the development of measure theoretical founda- 
tions so that simultaneously prior measures can be put on subsets of a product space 
of functions and measures and important quantities of interest are measurable. This 
non-trivial and highly technical task is achieved in Section 7 in the context of Polish 
topological spaces. 

In this generalization, priors are probability measures on spaces of measures and 
functions, and computing optimal bounds on prior values (given a set of priors) re- 
quires solving problems in which the optimization variables are measures on spaces of 
measures and functions. Section 4 shows how such optimization problems can, under 
general conditions, be reduced to the iteration of two optimization problems in which 
the optimization variables are measures and functions, where then we can apply the 
reduction theorems of [ ]. 

Our motivation for addressing the measurability and reduction issues raised in sec- 
tions 7 and 4 goes beyond the investigation of the Brittleness of Bayesian Inference as we 
also seek to lay down the foundations for the scientific computation of optimal statistical 
estimators (i.e., using computers to find estimators with minimal statistical errors, this 
constitutes a sequel work). 

Section 5 provides similar reduction theorems for the computation of optimal bounds 
on posterior values given a set of priors and the observation of the data. These reduction 
theorems lead to the Brittleness results (Theorems 5.12, 6.4 and 6.10). In particular. 
Section 6 presents the Brittleness under Local Misspecification theorems (theorems 6.4 
and 6.10). That is, given a Bayesian model. Theorem 6.4 provides optimal bounds on 
posterior values for priors that are at arbitrarily small distance (in the Prokhorov or 
total variation metrics) from a given Bayesian model. Theorems 6.4 and 6.10 show that 
these optimal bounds on posterior values are the essential supremum and infimum of 
the quantity of interest irrespective of the size of data and of the size of the metric 
neighbourhood around the Bayesian model. Sections 8 and 9 contain the proofs of our 
results. 



1.2 Notation and Conventions 

Throughout, for a topological space y, B{y) will denote the Borel a-algebra of subsets 
of y and Ai{y) will denote the space of Borel probability measures. For an alternative 
cj-algebra S3; of subsets of y the set of probability measures on the a-algebra T^y will 
be denoted Ai{T,y). Uy is metrizable, Ai{y) is endowed with the weak-* topology and 
the corresponding Borel c-algebra unless specified otherwise. For a mapping between 
topological spaces, the term measurable will mean Borel measurable unless specified 
otherwise. Moreover, suprema over the empty set will have the value — cx) and infimima 
over the empty set the value +00. 

2 Bayesian Inconsistency and Model Misspecification: a 
motivating analysis 

To motivate the results of this paper, this section will analyse and review questions of 
Bayesian consistency, inconsistency, model misspecification, and robustness. There is, 
of course, a large literature on these topics, and we will not attempt to be exhaustive in 
providing references; rather, our aims are: first, to give a short reminder on how Bayesian 
inference is currently employed in Uncertainty Quantification (UQ); second, to identify 
issues and popular beliefs about what one actually learns from Bayesian inference, and 
thereby motivate the results of this paper; and, last, to present sufficient references that 
the interested reader can find technical justification for the formal manipulations of this 
section. 

In this section, we are interested in estimating 

$(/.t) (2.1) 

where $ is a known quantity of interest function and fi' is an unknown (or partially 
known) probability measure on X. For the purposes of exposition, in this section, we 
assume that X = M.''. One example of a quantity of interest, when A' = R, is ^{fi^) := 
//'[X > a] (the probability that the random variable X distributed according to /x^ 
exceeds the threshold value a) . We also assume that we are given n independent samples 
di, . . . ,dn, each distributed according to fi^ . 

We will now present the parametric Bayesian answer to this problem. For the pur- 
poses of exposition, in this section, we restrict our attention to parametric Bayesian 
inference. We first introduce {fJ^i • , 0)} e^e a family of probability distributions on X 
parameterized hy 6 G Q (and commonly referred to as the model class). For the sake of 
simplicity here we also assume that = R^. Let 

Note that ^0 is a subset of ^A{X) that may or may not contain ^u"!". If fi"^ ^ Ao, then 
the model is said to be misspecified; otherwise, the model is said to be well specified. 

We next introduce po ^ M{@), a probability distribution on Q (the prior distribution 
on 6). Let ttq be the push- forward of po under the map 9 >-^ fi{- ,6) and observe that ttq 

6 



is a probability distribution on Aq, i.e. vro G A^(^o)) and that ttq is the distribution of 
the random measure fj,{- ,9) when 9 is distributed according to po- 

The next step is then to estimate *^(/^^) via conditioning. Let pn S Ai{@) be the pos- 
terior distribution of 9 given the observation of the i.i.d. samples di, . . . ,dn, as obtained 
using Bayes' formula, and let 7r„ be the push- forward of p„. The Bayesian estimate of 
$(/Li^) is therefore 

IE/.~^J^(^)]- (2.3) 

For the purposes of exposition, we assume that the measures fj,{- ,9) and fj,^ are all 
absolutely continuous with respect to the Lebesgue measure and write P{- ,9) and /S^ 
for their densities, which we assume to be continuous. Similarly, we assume that the 
measure po is absolutely continuous with respect to the Lebesgue measure and, abusing 
notation, write po for both the measure po and its (continuous) density, and similarly for 
Pn{-), the posterior density of on given the observation the samples di, . . . ,dn- We 
will now examine the convergence properties of the sequence of posterior densities Pn{9) 
as n — >■ oo. This analysis being classical (see for instance [n 1] and references therein), 
our purpose is not to provide rigorous justifications but rather to familiarize the reader 
with the mechanisms regarding the convergence of posteriors. 

We have 

pown-=i/3K,g) _ pown-=i^(rf.,^) 

^'^^ ^ !eM0')WUP^d„9')d9' - Epjn-=i/3(«!„ •)] ^ ' ^ 



which we write as 






where 



Ln\ 



1 " 

-Vlog/3((i„e). (2.6) 



n 



Recall that Hfci P{dj,9) is commonly known as the likelihood and Ln{9) as the (sample) 
average log-likelihood. 

Consistency and the large-sample limit. Now observe that if log (3{dj, 9) is in- 
tegrable then it follows from the Law of Large Numbers that Ln{9) converges almost 
surely, as n — )• oo, to the expected log-likelihood L{9) defined by 

L{9):= f p\x)\og{P{x,9))Ax. (2.7) 

Jx 

Assuming that L{9) has a unique maximizer 9* & Q — known as the maximum likelihood 
estim,ator (MLE) — and that pq is strictly positive in every neighborhood of ^*, it follows 
under assumptions on the regularity of /3 (or local strict convexity in the neighborhood 
of 9*) that Pn{9) converges, almost surely, as n — t- oo, towards a Dirac mass supported 



at 9*. Therefore, assuming $ to be sufficiently regular, the Bayesian posterior estimate 
of $(//t), i.e., 

' <^{fx{-,e))pn{e)de (2.8) 



'e 
converges almost surely as n — )• cxd to 

$(M-,r)) (2.9) 

Note that 

L{e) = Ent{l3^) - Dkl{P^\^{- ,0)), 

where Ent(/3') := — f-^ 13' (x) log /S' (x) dx is the entropy of /3' and I?kl denotes the 
Kullback-Leibler divergence defined by 

/3t(x) 



DKL(/3l/3(-,e)):=E,^^t 



log 



f3{x,9) 



(2.10) 



It follows that 9* is also the minimizer of Dkl(/3^||/3( • j ^)) with respect to 9, i.e. the 
MLE 9* is characterized by the property that fj,{- ,9*) is the distribution having minimal 
relative entropy to jj,^ in the model class {//( • , 0)}6ig0. 

An immediate consequence of this observation is the fact if the model is not mis- 
specified, i.e. if fi^ is an element /x( • , 9^) of the model class, then 9* = 9\ fi{ • , 9*) = /i"!", 
and the Bayesian estimate (2.8) is asymptotically exact in the limit as n — )• oo. In this 
situation, the Bayesian estimate is said to be consistent. 

This convergence result is known as the Bernstein-von Mises Theorem (see for in- 
stance [84, Theorem 5]) or as the Bayesian Central Limit Theorem, since the limiting 
posterior can even be described in a more refined way as being asymptotically normal 
and not just a point mass. The condition that every open neighbourhood of 6' has 
strictly positive po-probability is known informally as Cromwell's Rule^. 

What happens when the model is misspecified? To provide an illustrative answer 
to this question, consider the family of Gaussian models {/3( ■ ,9) \ 9 = {c,a) gM.x M+}, 
where 

P{x,c,a) = — 7=exp -^- 



What will happen when this model is exposed to data coming from a potentially non- 
Gaussian truth /i' , with density /3' , that has a well-defined mean c' and standard devi- 
ation (T^? By the above considerations, 6* maximizes the expected log-likelihood (2.7) 
with respect to 9, and the expected log-likelihhod is simply 

L{9) = - I ^\x) ^'^~f drc-(loga) [ p\x)dx -\ogV2^ . (2.11) 



^ Since the posterior cannot possibly concentrate on a point outside the support of the prior, having 
a globaUy-supported prior and hence not ruhng out a priori any G as a possible S^ can be seen as 
a Bayesian version of OUver Cromwell's famous injunction to the Synod of the Church of Scotland in 
1650: "I beseech you, in the bowels of Christ, think it possible that you may be mistaken." 



M{X) 



/x(-,r)? 




Ao = {fi{;e)\eee} 



Figure 2.1: According to the brittleness theorems 5.12, 6.4 and 6.10, the Bayesian model 
{fi{- ,d)}g^0 may be arbitrarily close to the (true) data generating distribution ^^ (in 
Prokhorov and/or total variation metrics or in terms of the number of finite-dimensional 
marginals of the data generating distribution that are exactly captured) and still make 
the largest possible prediction error after conditioning on an arbitrary large number of 
sample data. Note that, for complex systems systems, the data generating distribution 
is a point in the infinite dimensional space of measures A4{X) whereas the Bayesian 
model is a finite-dimensional subspace of A4{X). Note also that if the truth fi^ and the 
model class are not mutually absolutely continuous, then the relative entropy distance 
from fi^ to the model class will be infinite, and this is generic in infinite-dimensional / 
non-parametric settings. 



A quick calculation using partial derivatives shows that 9* = {c*,a*) maximizes (2.11) 
if and only if c* = c' and a* = a' . That is, the Bayesian estimate (2.3) of $(//T), for any 
distribution fj,^ of mean c^ and standard deviation a\ converges almost surely as the 
number of sample data goes to infinity, towards $(/i( • , {c^ ,a^))) , where /u( • , {c^ ,a^)) is 
the unique Gaussian distribution on M with mean c' and standard deviation a' . 

However, now there is a problem: there are many different probability distributions 
/i on M that have the same first and second moments as fx' but have, say, different higher- 
order moments, or different quantiles. Predictions of those other moments or quantiles 
using //( • , (c' , cjT)) can be inaccurate by orders of magnitude. A trivial, albeit extreme, 
example is furnished by '&(/i) := E^[|X — c^| > ta^] (where c^ and o"^ denote the mean 
and standard deviation of /i). Under the Gaussian model, 

^[\X - c^\ > ta^] = 1 + erf U-^ 

whereas the extreme cases that prove the sharpness of Chebyshev's (Markov's optimal) 
inequality have 

lP[|^-c^| >taf,] =min|l,^ 

In the case of the archetypically rare "Go" event" , the ratio between the two is approxi- 
mately 1.4x 10^. This comparison is, of course, an almost perversely extreme comparison: 
it would be obvious to any observer with only moderate amounts of sample data that 
the data were being drawn from a highly non-Gaussian distribution. However, it is not 
inconceivable that the true distribution fi' has a Gaussian-looking bulk but tails that 
are significantly fatter than those of a Gaussian, and the difference may be difficult to 
establish using reasonable amounts of sample data; yet, it is those tails that drive the oc- 
currence of "Black Swans" , catastrophically high-impact but low-probability outcomes. 
The results of this paper show that this situation is generic, and cannot be avoided no 
matter how many moments or integrals of arbitrary test functions of the truth ^^ are 
matched nor how "close" /x^ is to the class {//( • , 6)}g^Q. 

2.1 Bayesian Inconsistency and Model Misspecification 

To quote [ - ], "[w]hile for a Bayesian statistician the analysis ends in a certain sense with 
the posterior, one can ask interesting questions about the the properties of posterior- 
based inference from a frequentist point of view." Many of these questions are asymptotic 
in nature: for example, in the limit of infinitely many independent /iT_(;iist]^i|-mf;g[i sam- 
ples, will the posterior converge in a suitable sense to //' regardless of the initial choice 
of prior vr? This property is referred to as consistency'^; a general survey of consistency 
results is found in [ ' ' '] . As noted above, the consistency theorem is generically known 



■^Sometimes the term frequentist consistency is used, reflecting the fact that it lies outside the strict 
Bayesian worldview, and, to a fundamentahst Bayesian, even to say that the data are generated by fi' 
is a frequentist heresy. 
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as the Bernstein-von Mises theorem [24, 110], although the earliest rigorous proofs are 
due to Doob [ ] and Le Cam [7(i]. 

Unfortunately, Cromwell's Rule is only necessary, and not sufficient, to ensure con- 
sistency. In fact, consistency is far from being a generic property, and once the prob- 
abihty space contains infinitely many points (and hence any parameter space @ that 
parametrizes all probability measures on that probability space is infinite-dimensional), 
inconsistency is not the exception, but the rule [ ]. In [ , Sec. 5], Freedman considered 
a countable index set N := {1, 2, . . . } and the parameter space 



0:= le-.n^ [0,1] 



E^« = i 



ieN 



Each 9 gives rise to a probability distribution ¥g = n{- ,9) under which the observations 
Xi, ^2, . . . are IID with P^fX^ = i] = 9{i). The problem is assumed to be well-specified, 
so that one particular 9^ ^ Q \s considered to be the "true" parameter value, and the 
frequentist data-generating distribution is ^j) = Pgt = A'C " > ^^)- Theorem 5 of ^ '] shows 
that, when supp(^T) ig infinite, given any "spurious" probability distribution Q = Pg, 
there exists a prior probability measure vr on that has 9'^ in its support, such that 
the posterior of vr /x^-a.s. concentrates on q in the limit of observing infinitely many 
i.i.d. //^-distributed samples. In fact, there is a prior that gives positive mass to every 
open subset of G but yields consistent posterior estimates for only a first-category set of 
possible "true" (data-generating) parameter values 0^. 

There are conditions on priors that do ensure consistency in infinite-dimensional or 
non-parametric contexts, e.g. the tail-free priors introduced by Freedman in ["> '] and 
hybrid Bayesian-frequentist tools such as Dirichlet process priors [ou]. However, while 
the collection of "bad" priors that lead to inconsistent results is measure-theoretically 
small [44, 33], it is topologically generic [55]. 

It is important to appreciate that the requirement of positive prior mass in every 
neighborhood of the true distribution depends upon the topology placed upon M.{X). 
For example, Schwartz [96] showed that every tt that puts positive mass on all Kullback- 
Leibler (relative entropy) neighborhoods of /i^ is weakly consistent. On the other hand, 
Freedman ["> '] and Diaconis & Freedman [\'l] show that vr may put positive mass on 
all weak neighborhoods of ^^ and still fail to be weakly consistent — e.g. by not be- 
ing tail- free. Nor are results limited to weak convergence of the posterior to /i^. For 
example, [10] shows that consistency holds in the Hellinger distance if vr puts positive 
mass on all Kullback-Leibler neighborhoods of /x^^ and certain smoothness and tail con- 
ditions are satisfied; see [112, 115] for further results on Hellinger and Kullback-Leibler 
consistency. The amount of prior probability mass that lies Kullback-Leibler-close to 
the truth, quantified using a notion called thickness, can be used to quantify the con- 
vergence properties of Bayes estimates [1, 2, 79]. However, in the infinite-dimensional 
contexts that are increasingly subject to Bayesian analyses, it is important to note that 
probability measures are "usually" mutually singular and "rarely" mutually absolutely 
continuous, and so the Kullback-Leibler neighbourhoods of jj) are small sets that are 
"unlikely" to intersect the model class. 

11 



The situation in which there is no 6^ ^ Q such that fj,^ = ^{- ,9^) is referred to as 
model misspecification. The consistency and other asymptotic properties of niisspecified 
models appear to have first been considered by Berk [21, 22] and Huber [ ]. See [73, 74] 
for a recent contribution, and [ ] for convergence rates. 

"In practice, Bayesian inference is employed under misspecification all the 
time, particularly so in machine learning applications. While sometimes it 
works quite well under misspecification [2(3, 73], there are also cases where 
it does not [3-'5, 58], so it seems important to determine precise conditions 
under which misspecification is harmful — even if such an analysis is based 
on frequentist assumptions." 

There is a reasonable popular belief that gross misspecification of the model will be 
detected by some means before engaging in a serious Bayesian analysis; indeed there do 
exist tests [63, 119] for model misspecification, but it is important to note that while one 
can determine that the model is niisspecified, one cannot be sure that the model is well- 
specified. There is also an understandable popular belief that these tests mean that one 
need only be concerned with the situation of "mild misspecification" , and that provided 
fi^ lies "close enough" to the model class {/i(- ,O)}g^0 (as illustrated in schematic form 
in Figure 2.1), the posterior estimates will still converge to a usefully informative limit. 
Simply put, one aim of this paper is to show that this belief is wrong. 

2.2 Bayesian Robustness 

"Most statisticians would acknowledge that an analysis is not complete unless 
the sensitivity of the conclusions to the assumptions is investigated. Yet, in 
practice, such sensitivity analyses are rarely used. This is because sensitivity 
analyses involve difficult computations that must often be tailored to the 
specific problem. This is especially true in Bayesian inference where the 
computations are already quite difficult." [116] 

One response to the concern that the choice of prior (and likelihood) is somewhat 
arbitrary is to perform Bayesian analysis over classes of priors (and likelihoods): this 
approach is known as robust Bayesian inference. The robust Bayesian viewpoint appears 
to have been introduced independently by Box [29] and Huber [66]; see e.g. [19, 20] 
and Chapter 15 of [ ] for surveys of the field. In the robust Bayesian approach, a 
class n of priors and a class A of likelihoods together produce a class of posteriors by 
pairwise combination through Bayes' rule. Robust Bayesian methods are a subclass of 
the methods of imprecise probability; the idea that the probability of an event need not 
be a single real number has a history stretching back to Boole [28] and Keynes [72] , with 
more recent and comprehensive foundations laid out in e.g. [75, 114, 118]. 

One way of generating such a class 11 of priors is via a belief function, as in [117] and 
Dempster-Shafer theory more generally. The belief function framework encompasses 
prior probabilities whose values are known only on some finite partition of the proba- 
bility space, and not the whole a-algebra; classes of e-contaminated priors can also be 
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represented in this way, as well as classes of locally perturbed priors. The belief function 
approach has the useful feature that explicit formulae can be given for the lower and 
upper posterior probabilities of events [117, Theorem 4.1]. 

Another typical approach to generating a class 11 might be to consider a finite- 
dimensional parametrized class of models. For example, one could consider, instead 
of a single Gaussian prior on M of specified mean and variance, a two-parameter class 
of Gaussian priors with a range of means and variances, or a three-parameter class 
of skew-Gaussian priors. Similarly, one might consider a two-parameter class of beta 
distributions instead of a uniform prior on a bounded interval. 

However, a danger in specifying a finite-dimensional class 11 of priors is that one is 
making very strong statements about the form of the priors, particularly with regard to 
the tails, that cannot be justified based on often-limited amounts of prior information. 
For example, if all the priors vr G H have thin tails, then the class H will have a very dif- 
ficult time modelling events that lie in those tails, even when exposed to data from those 
regions. This problem is particularly important in applied fields such as catastrophe 
modelling, insurance, and re-insurance, in which the catastrophic events of interest are 
by definition high-impact low-probability "Black Swan" events: the difference between 
an exponentially small and an inverse-polynomially small tail can be vitally important. 
Also, because members of a finite-dimensional parametric family 11 of priors often have 
similar qualitative properties (such as being mutually absolutely continuous), the ap- 
parently broader perspective does not not add much to the asymptotic posterior picture 
in terms of robust consistency, although it does provide a broader understanding given 
finitely many samples. 

Rather than specifying a finite-dimensional H, it is epistemologically more reason- 
able to specify a Unite- codimensional H, for example by specifying interval bounds on the 
expected values of finitely many observed test functions (i.e. generalized moment inequal- 
ities) ; this setting encompasses the finite-partition belief function framework mentioned 
above. Calculation of optimal prior and posterior bounds on quantities of interest is often 
an exercise in numerical optimization [25, 86, 98] rather than closed-form formulae. 

2.3 Purpose 

In terms of the above discussion, one purpose of this paper is to explore the extent 
to which one can simultaneously have robust Bayesian analyses that produce consis- 
tent answers, given that the models used (both priors and likelihoods) are certain to 
be misspecified to some degree. Can one be "just a little bit wrong" in terms of model 
misspecification? Our results suggest that the answer is strongly negative when "close- 
ness" is measured in total variation and Prokhorov metrics: either one's robust Bayesian 
model is well-specified, in which case there is robust consistency; or else the model is mis- 
specified — even slightly — and the limiting posterior bounds are no more informative 
than "L°°-type" worst- and best-case bounds. 
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3 Incorporation of Bayesian Priors in the OUQ framework 

3.1 A quick reminder on the OUQ framework 

Let ^ be a topological space, let J-{X) be the space of real- valued measurable functions 
on X, and let Q C J-{X) be a subset. Let A be an arbitrary subset of G x Ai(X), 
and let <I>: ^ X M{X) — t- M be a function producing a quantity of interest. In the 
context of uncertainty quantification one is interested in estimating ^{f\fi^), where 
{P , fi') £ Q X Ai{X) corresponds to an unknown reality: the function f' represents a 
response function of interest, and ^^ represents the probability distribution of the inputs 
of p. If A represents all that is known about (/^,//^) (in the sense that (/^,//^) £ A 
and that any (/, jj,) ^ A could, a priori, be (/^, iJ,^) given the available information) then 
[86] shows that the quantities 

U{A) := sup $(/,m) (3.1) 

C{A) := iiif $(/,//) (3.2) 

(/,m)6.4 

determine the inequality 

C{A)<mlfi^)<U{A), (3.3) 

to be optimal given the available information (/^^,/x^^) G ^ as follows: It is simple to see 
that the inequality (3.3) follows from the assumption that {f\ij)) G A. Moreover, for 
any e > there exists a (/, ^x) ^ A such that 

U{A)-e<Hf,^Ji)<U{A). 

Consequently since all that we know about {f\ii^) is that {f\ij)) G A, it follows that 
the upper bound <I>(/^,^^) < U{A) is the best obtainable given that information, and 
the lower bound is optimal in the same sense. 

Although the OUQ optimization problems (3.1) and (3.2) are extremely large, we 
have shown in [SO] that an important subclass enjoys significant and practical finite- 
dimensional reduction properties. First, by [ -.;, Cor. 4.4], although the optimization 
variables (/, ^) lie in a product space of functions and probability measures, for OUQ 
problems governed by linear inequality constraints on generalized moments, the search 
can be reduced to one over probability measures that are products of finite convex 
combinations of Dirac masses with explicit upper bounds on the number of Dirac masses. 

Furthermore, in the special case that all constraints are generalized moments of func- 
tions of /, the dependency on the coordinate positions of the Dirac masses is eliminated 
by observing that the search over admissible functions reduces to a search over functions 
on an ?TT,-fold product of finite discrete spaces, and the search over m-fold products of 
finite convex combinations of Dirac masses reduces to a search over the products of prob- 
ability measures on this ?TT,-fold product of finite discrete spaces [86, Thm. 4.7]. Finally, 
by [86, Thm. 4.9], using the lattice structure of the space of functions, the search over 
these functions can be reduced to a search over a finite set. 
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Example 3.1. A classic example is ^{f,fj,) := fx[f > a] where a is a safety margin. 
In the certification context one is interested in showing that /i'[/' > a] < e, where 
e is a safety certification threshold (i.e. the maximum acceptable /i^-probability of the 
system p exceeding the safety margin o). IfL({A) < e, then the system associated with 
(/',//') is safe even in the worst case scenario (given the information represented by A). 
If C{A) > e, then the system associated with {p , fj,^) is unsafe even in the best case 
scenario (given the information represented by A). If C{A) < e < U{A), then the safety 
of the system cannot be decided (although we could declare the system to be unsafe due 
to lack of information) . 

3.2 Bayesian priors on spaces of measures and functions 

In the OUQ setting, an assumption of the form 

(/n^^)eA 

in terms of an assumption set A Q G x Ai{X) where Q Q J^(X), was used to derive 
the optimal inequality (3.3). This paper will consider the situation in which one has 
priors on the admissible set A and also information in the form of sample data. One of 
our goals is to analyse the robustness (or brittleness) of Bayesian inference by obtaining 
optimal bounds on posterior values given local misspecifications. 

In order to define priors on the space of admissible scenarios, A needs to be given 
the structure of a measurable space; i.e. a suitable u-algebra S_4 on A must be provided. 
When this is accomplished we will refer to a probability measure tt G A^(S_4) as a 
prior. However, this is a non-trivial task because if the u-algebra on A is too small then 
natural functions of interest $ may not be measurable, and if it is too large then the set 
of probability measures on A may be empty or too small. Moreover, it would also be 
convenient if we could easily apply the reduction theorems of [86] . 

Section 7 will show that Polish (i.e. separable and completely metrizable) spaces pro- 
vide a natural setting for our work. In particular, we develop simple conditions on the 
function space Q and base space X for which: (1) the reduction theorems of [86] apply 
when A is any Borel subset of the product space Q x A^(Af), when M{X) is endowed 
with the weak-* topology; and (2) the classic object of interest ^■. Q x M{X) — )• M 
defined by $(/, /x) := n[f > a] is measurable. As stated in Theorem 7.5 the conditions 
are that X be Polish, Q be Polish, and the evaluation function i^; : ^ — t- M defined by 
ix{g) '■= g{x) be Borel measurable for each x ^ X. Moreover, we show that many func- 
tion spaces satisfy these conditions: Theorem 7.10 shows that a Reproducing Kernel 
Hilbert Space (RKHS) or Reproducing Kernel Banach Space (RKBS) of functions over 
X with measurable feature map(s) satisfy these criteria; Theorem 7.12 asserts that the 
space of upper semicontinuous functions with the Wijsman topology also satisfies these 
conditions. In addition, many of these function spaces are known to be very expressive. 
For example, Steinwart [100] introduced universal kernels on compact domains as those 
whose RKHSs which can approximate any continuous function uniformly, and demon- 
strated that many of the existing popular kernels, in particular the Gaussian kernels. 
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are universal. For non-compact X, Steinwart, Hush and Scovel [103] provide conditions 
on the kernel that guarantee approximation properties in L^ spaces. For a thorough 
discussion of these matters in the context of Learning Theory, see [101]. 

In the process of establishing these results, we have obtained a new result which, 
in addition to being very useful to us, may be important for Learning Theory: on a 
bianalytic space (see Frolik [ ] for the definition and the proof that a Polish space is 
bianalytic). Lemma 7.9 implies that a RKHS or RKBS with measurable primary feature 
map is separable. 

Remark 3.2. The desire to have the Borel measurable structure of a Polish space might 
seem to be a spurious level of abstraction, but there are many good reasons for it. The 
first is that, by Suslin's Theorem ["', Thm. 14.2], all Borel subsets of a Polish space 
are Suslin, where a Suslin space is a continuous Hausdorff image of a Polish space. 
Indeed, Suslin sets are important in measurable selection theorems (see e.g. [ :]) such 
as those that we use in the proof of Lemma 4.10; furthermore, in addition to Ulam's 
theorem [7, Thm. 4.3.8] that all probability measures on a Polish space are regular 
(approximable from within by compact sets), Schwartz' theorem [.m] implies that that 
all probability measures on a Suslin space are regular, and, therefore, [108, Thm. 11.1] 
implies that the extreme points in the space of probability measures on a Suslin space 
are the Dirac measures. Consequently, when the product Q x M{X) is Polish, any Borel 
subset A ^ Q X M[X) \s Suslin and so the extreme points of probability measures on A 
are the Dirac measures, and some powerful measurable selection theorems are available. 
Moreover, when the base space is metrizable, then the space of probability measures is 
Polish in the weak-* topology if and only if the base space is Polish. 

Furthermore, since separability is equivalent to second countablility for metric spaces, 
we have that the Borel structure of a product is the product of Borel structures of 
Polish spaces. In addition, by [4S, Thm. 10.2.2], regular conditional probabilities exist 
for observables with values in a Polish space. Moreover, in some sense, there is only one 
Polish measurable space by a construction of Skorokhod [(iO]. Also, Polish spaces are the 
spaces of Descriptive Set Theory, see e.g. Kechris [71], and fundamental to our results, 
see Lemma 7.2, will be a surprising result of Kechris. Finally, Polish spaces appear to 
be the appropriate spaces to play topological games such as the Banach-Mazur game 
[87], the Sierpihski game, the Ulam game, the Banach game, and the Choquet game. 
Moreover, Choquet's theorem [71, Thm. 8.18] says that a separable metric space is 
completely metrizable (and hence Polish) if and only if the second player has a winning 
strategy in the strong Choquet game. For a review of topological games, see Telgarsky's 
review [107], and for topological games in hyperspace see that of Zsilinszky [124]. 

3.3 Data Spaces and Maps 

Henceforth A will be a topological space. In practice the response function P and 
the probability measure ^'^ are not directly observed and the sample data arrives in 
the form of (realizations of) observation random variables, the distribution of which 
is related to {f\^'^). To simplify the current presentation we will assume that this 
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relation is determined by a function of (/^,^^) — such as the case where the data 
{Xi, p{Xi)), . . . , {Xn, P{Xn)) are determined by n independent reaUzations Xi of the 
random variable X determined by the possibly unknown distribution ^^ . Throughout 
this paper we will use the following notation: D will denote the observable space (i.e. 
the space in which the sample data take values); T> will be assumed to be a metrizable 
Suslin space and D will denote a P-valued random variable producing the observed 
sample data. To represent the dependence of the observation random variable D on the 
unknown state {p , fi^) G ^ we introduce a measurable function 

D: A^M{V), 

where A^(P) is given the Borel structure corresponding to the weak-* topology, to define 
this relation. The idea is that D(/, /i) is the probability distribution of the observed 
sample data D{f,fi) if {f',fi') = {f,fJ-), and for this reason it may be called the data 
map or — even more loosely — the observation operator. Often, for simplicity, we 
will write D instead of D{f,fj,). For simplicity and clarity, we save for Section 3.5 the 
consideration of the case where the sample process 0(/, n) has uncertainties with (/, //) 
known. 

We proceed with a natural generalization of the Campbell measure and Palm distri- 
bution associated with a random measure as described in [70] (see also [-38, Ch. 13] for 
a more current treatment). To that end, observe that since T? is metrizable, it follows 
from [4, Thm. 15.13], that, for any B £ B(p), the evaluation z^ i-> i^iB), v G M{V), is 
measurable. Consequently, the measurability of D implies that the mapping 

%:AxB{V) ^ R 

defined by 

D((/,^),5):=D(/,^)[S], ioT{f,^i)£AB£B{V) 

is a transition function in the sense that, for fixed (/, jjl) G A, ID)((/, /u), • ) is a probability 
measure, and, for fixed B G B{T>), D(-,i?) is Borel measurable. Therefore, by [27, 
Thm. 10.7.2], any vr G Ai{A), defines a probability measure 

7r0DGA^(S(^) xS(P)) 

through 

7r0D[.4xS] :=E(^.^)^4l^(/,/i)D(/,M)[S]], ior A e B{A),B e B{V), (3.4) 

where 1^ is the indicator function of the set A: 

\0, if(/,M)M. 

It is easy to see that tt is the ^-marginal of vr D. Moreover, when X is Polish, 
[4, Thm. 15.15] implies that M{X) is Polish, and when Q is Polish it follows that 
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A Q G X Ai(X) is second countable. Consequently, since D is Suslin and hence second 
countable, it follows from [ , Prop. 4.1.7] that 

B{A xV) = B{A) X B{V) 

and hence vr © D is a probability measure on A x D. That is, 

7rQDeM{AxV). 

Let us refer to an element of M{A) as a prior on A. With a prior vr on A, the 
quantity of interest $(/,//) becomes a random variable and we will be interested in 
estimating its distribution conditioned on the observation D & B, where B € 8(1)). 

Example 3.3. In the context of Example 3.1, we are interested in estimating the prob- 
ability (under the prior tt) that the system is unsafe, conditioned on the observations 
D & B, i.e. the conditional expectation 

(7r0D)L[/>a] >€\D £B 

If D corresponds to observing independent realizations of {X,G{X)), then the obser- 
vation space P is (Af X M)" and the measure D(/, //) is the one associated with the 
random variable D = ((X^, /(X^)), . . . , (X", /(X"))) where the X^ are independent 
and distributed according to fi. 

If D is the random variable that results from observing n independent realizations 
of {X,f{X) + ^) (/ is observed with additive Gaussian noise ^ ~ AA(0, cr^)), then the 
measure D(/, /u) is the one associated with the random variable D = (^{X^,f{X^) + 
^^), . . . , (X", /(X") -|- ^"')) where the X* are independent and distributed according to 
/i and the ^* are independent Gaussian random variables of mean zero and variance o"^ . 

3.4 Bayes' Theorem and conditional expectation 

Henceforth A will be a Suslin space, and suppose now that we have n QD £ Ai{A x D) 
constructed in the above way. Let tt • D denote the corresponding Bayes' sampling 
distribution defined by the 2?-marginal of vr D, and note that, by (3.4), we have 

TT • BIB] := E^M^^ [ID)(/, /i)[S]] , for 5 G B{V). (3.5) 

Since both V and A are Suslin it follows that the product ^ x P is Suslin. Con- 
sequently, [27, Cor. 10.4.6] asserts that regular conditional probabilities exist for any 
sub-cr-algebra of B[A x 2?). In particular, the product theorem of [27, Thm. 10.4.11] 
asserts that product regular conditional probabilities 

{TrQB)\d£M{A), ford£V 

exist and that they are vr • D-a.e. unique. 

When we consider vr € Ai{A) a prior, then this result can be interpreted as the 
posteriors of Bayes' Theorem. However, because such regular conditional probabilities 
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are only uniquely defined vr • D-a.e., when a data sample d G "D arrives such that vr • 
D[{(i}] = 0, a posterior (vr n)\d that could be any of the vr • D-a.e. -equal regular 
conditional probabilities evaluated at d appears to have dubious utility. Indeed, the fact 
that the regular conditional probabilities are only uniquely defined n ■ D-a.e. suggests 
that integrals of posteriors over subsets B G B{T>) such that vr • D[i3] > are the more 
natural objects. Moreover, the restriction that B be an open set is natural for practical 
reasons, since conditioning on D lying in an open subset B rather than on its exact 
value is what one has to do when the sample data can only be observed after rounding 
error. Furthermore, we will show in Section 5 that if the data d have been sampled from 
a probability measure vr^ • D for some vr''" G A^ {A) (commonly called a "true prior" in 
Bayesian statistics) then with vtT • D probability one (on the realization of d), the vtT • D- 
measure of any open set containing d is strictly positive. In other words, vr^^ • D-almost 
surely, vr' (the "true prior") belongs to the random subset of A^(^) defined as the priors 
■K G M. {A) such that vr • D [S] > for any open set B containing the data d (this subset 
is randomized through the realization of the data d). 

Finally, throughout, we will find it useful to assume that 

<I> is semibounded (3-6) 

in that it is either bounded above or bounded below. Semiboundedness is sufficient to 
ensure that the integral of $ with respect to any probability measure exists, possibly 
with the value oo or — oo, and such integrands are sufficient for the reduction theorems 
of Winkler [121] that we use. 

Remark 3.4. Note that the assumption that $ is semibounded is mostly for convenience 
since integrands which are not semibounded, like that defining the first moment, can be 
considered by restricting the space of measures to those measures that have well defined 
first moments. 

3.5 Incompletely specified priors and observation maps 

In practical situations, the observation map D may be imperfectly known. For example: 

1. the sample data may be corrupted with experimental noise with unknown distri- 
bution; 

2. the observations of {X, p{X)) may not be independent and the available informa- 
tion on their correlations may be limited to that contained in a covariance matrix; 

3. the sample data may also not correspond to direct observations of {X, p{X)) under 
the measure //^ \yY\i to an observation of random variables correlated through a 
unknown process, possibly involving an inverse problem (that may be ill-posed). 

Let us also observe that: (1) the choice of a particular prior on A involves a degree of 
arbitrariness that may be incompatible with the certification of rare/critical events (2) 
the definition of such a prior is a non trivial task if A is infinite dimensional. For these 
reasons it is necessary to consider situations in which the prior vr and the observation 
map D are imperfectly known or specified. More precisely, the (lack of) information (or 
specification) on vr and D can be represented via the introduction of two spaces 11 and 3 
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where the subset 11 C A^ (^A) consists of the set of admissible priors vr and 3 is a subset 
of the set of all measurable mappings D from A into Ai(T>). 

One of our goals in allowing incompletely specified priors is to assess the robustness 
of posterior Bayesian estimates with respect to the particular choice of priors. More 
precisely we will compute optimal bounds on E7r[$] when vr € 11 and show how these 
bounds are affected by the introduction of sample data by computing optimal bounds 
onE^Qo[<!>\B], iov B e B(V). 

4 Optimal bounds on the prior value 

Recall that for a subset A and a measurable quantity of interest ^: ^ — )■ M, that under 
the assumption {p , fi^) G A, we have the optimal upper U{A) and lower C{A) bounds 
on the value <&(/^,/i^) of the quantity of interest, defined in (3.1) and (3.2) by 

U{A) := sup $(/,//) 

C{A) := mf Mf,^,). 

(/,M)e^ 

When we put a prior vr on A, we have to define the value ^(vr) of the prior vr 
corresponding to an extended quantity ^: A4{A) — t- M of interest corresponding to <I>. 
Disregarding integrability concerns, for a given $, let us call the induced function 

$(7r) :=E^[$], TreM{A), (4.1) 

the canonical one associated with $ and abuse notation by denoting the function $ as 
$. For such a canonical quantity of interest, we call the value IE7r[$] the prior value, and 
note that the values 

U{U) := supE^[$] (4.2) 

Tren 

C(U) := infE^r$l (4.3) 

form a natural generalization of the values U{A) and C{A). Moreover, in the same 
way that 1/({A) and C{A) are optimal upper and lower bounds on $(/"'', /i"!") given the 
information that {f\ fi^) G A, ZY(n) and £(11) are optimal upper and lower bounds on 
Ett [<I>] given the information that vr G 11. Of course, to be well defined, integrability 
concerns should be addressed. Indeed, Assumption 3.6 that <1> is semibounded implies 
that E7r[$] is well defined for any bounded measure vr, possibly with the value oo and 
—CO, and therefore the quantities in (4.2) and (4.3) are well defined. 

Remark 4.1. The restriction that the the extended quantity of interest corresponding 
to ^ be canonical is really no restriction, but is assumed only to simplify the presen- 
tation and notation. Indeed, there are many important extended quantities of interest 
that are not affine as functions of the measure vr. However, all the ones that we have 
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thought of can be handled by small modifications of the present framework, and their 
inclusion here would simply complicate the presentation and notation. Moreover, note 
that many affine non-canonical extended quantities of interest become canonical through 
simple transformations. For example, when $i(/, /x) := fi[f > a] is a quantity of inter- 
est, and the extended quantity of interest is the probability that the system is unsafe, 
i.e. 7r({(/, //) I //[/ > a] > e}) where {(/, /i) | //[/ > a] > e} is the set of unsafe (/, /i), 
then this extended quantity of interest is not canonical with respect to ^i. However, by 
transformation to $2 := l{r|r>e} ° ^ii the extended quantity of interest becomes canon- 
ical and U{Il) and C{Il), defined in terms of ^2, are optimal upper and lower bounds on 
the probability that the system is unsafe given the set of priors H. 

4.1 General information barriers on prior values 

Let 5: A ^- Ai {A) be the mapping of points to unit Dirac measures, where we denote 
^(/.^) as the Dirac mass at (/, /u), and, for 11 C M.(A), define 

^n := 6-^U = {{f,^,)eA\ <5(j,^) G U}. (4.4) 

That is, ^n consists of those scenarios (/, /i) that are not only admissible in the sense 
that they lie in A, but are also admissible as a prior in the sense that Sr/n) is an element 

of n. 

With the convention that U{0) := —00 and C{0) := -|-oo, the following theorem 
shows the relationships among 1/({A) and U{Au) as defined by (3.1), C{A) and C{An) 
as defined by (3.2), and U{U) and C{U) as defined by (4.2) and (4.3). 

Theorem 4.2. It holds true that 

U{Au) < U{U) < U{A) 

and 

C{A) <£(n) <C{An). 

Moreover, if An is non-empty, then 

C{A) < C{U) < C{Au) < U{An) < U{U) < U{A). 

4.2 Priors specified through marginals 

In many settings, probability measures or sets of probability measures are specified 
through generalized moments or other properties of marginal distributions. To analyse 
this case, let Q be a topological space and consider a measurable map "$ : A ^ Q. 
Let us abuse notation by also denoting the corresponding pushforward of measures ^ : 
M.{A) — 7- A^(Q) by the same symbole ^. For a probability measure Q G A^(Q), let 

^-iQ := {tt G M{A) I ^tt = Q} 
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be the set of probability measures vr G M{A) that push forward to Q. More generahy, 
for a non-empty set £2 C 7W ( Q) , let 

^"^a := {vr e M{A) I ^vr G £1} (4.5) 

be the set of probability measures vr G Ai{A) such that ^vr G £). Now, let £2 C A^(Q) 
be an admissible set of ^-marginals. Then the corresponding admissible set of priors 
is ^~^n C M-{A) and the corresponding objects to be computed are U{^~^£l) and 
Ci^-^Q) according to (4.2) and (4.3). 

We will now demonstrate how to reduce the computation of ZY(^~^0) and £(^~^£l) 
when Q is specified by linear inequalities. Later, in Section 4.2.2, we will develop a more 
powerful nested reduction which will provide the foundation for our reduction methods. 

Before we begin, we need to introduce some terminology. Following Winkler ['"21], let 
3^ be a topological space and let M. C j\4{y) be a set of measures. Let ext(7V() denote 
the set of extreme points of A4 and let the evaluation field S(ext(A1)) be the smallest 
o"-algebra of subsets of ext(A^) such that the evaluation map v i— )• i^{B) is measurable 
for all B G B{y). Then a measure v G M{y) is said to be a barycenter of M if there 
exists a probability measure p on S(ext(A^)) such that the bary centric formula 

i^iB)= [ iy'{B)dp{i^'), BeB{y) (4.6) 

holds. Furthermore, the following notion of a measure affine function is central to 
Winkler's [121] reduction theorems, which we use: 

Definition 4.3. An extended real- valued function F on A^ C Ai(y) is said to be 
measure affine if, for all z^ G A^ and all probability measures p on S(ext(A4)) for which 
the barycentric formula (4.6) holds, F is p-integrable and 



F{u) = [ F{u')dp{u'). 

Jext(M) 



/ext{M) 

A major consequence of the assumption (3.6), that $ is semibounded, is that IE;^[<I>] 
exists, with possible values oo and — oo, for all finite measures ly. As a consequence, by 
[121, Prop. 3.1], the extended-real-valued function 

is measure affine. 

4.2.1 Primary reduction for prior values 

Let us consider the computation of 

Ui^^-^Q) = sup E^[$] (4.7) 
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when £j is specified by n generalized moment inequalities determined by measurable 
functions gi i = 1, . . . ,n. The situation for the lower bound £(^~^£2) is the same. That 
is, let Ij, i = 1, . . . , n be n closed intervals, allowing semi-infinite intervals (—00, qi] and 
[qi, 00), and define 

Q = {Q e M{Q) I Eq[5,] € li for i = l,...,n}, 

where implicit in the definition is that all n integrals exist. Then, by a change of 
variables, E^^[(7j] = E7r[5i o vl/] holds if either integral exists (see e.g. [11 , Cor. 19.2]), so 
we conclude that 

^"^O := {vr G M{A) I ^^ G Q} 

= {tt £M{A)\ E^^ [gi] G /j for z = 1, , n} 

= {vr G A^(^) I Ej^lgi o ^] G /i for i = 1, . . . , n} 

and so conclude that ^~^£} is defined by the n generalized moment inequalities corre- 
sponding to 5, o ^ : ^ — )• M for = 1, . . . , n. Consequently, since the function vr 1— >• E7r[$] 
is measure affine, it follows from the reduction theorems of [8(3] that we can reduce the 
supremum on the right-hand side of (4.7) to the convex combination of n + 1 Dirac 
masses. To state the theorem we have just proven, let 



A(n) 




{fi,fJ-i) G ^, Qi > 0, for i = 0, . . . , n ^ . (4.8) 



be the set of non- negative combinations of n + 1 Dirac masses. Let the vector / of 
intervals have components /j for i = 1, . . . , n, let 

n(/) := ^-^n 

be defined as above, and consider the subset 

n(/, n) := n(/) n A(n) C n(I) (4.9) 

of those measures which are the n + 1-fold convex combinations of Dirac masses. 

Theorem 4.4. Let A be Suslin, let Q be separable and metrizable, and let ^ : A ^ Q 
be measurable. Moreover, for n measurable functions gi, . . . ,gn'- Q — )• M and n closed 
intervals /i, . . . , /„, let 

£2 := {Q G M{Q) I EQ[gi] e li for i = 1, . . . ,n} 

define the admissible set of ^-marginals. Then, 

U{U{I))=U{U{I,n)) 

where 

'supE"=o««^(/i'A'i) 
U{U{I,n)) = I among {fi,tii) £ A, Oi > 0, Eto^^i = 1 (4-10) 

^ such that Ya=q ^i9j {^{fi, l^i)) ^ Ij for j = I, . . . ,n. 
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Remark 4.5. The freedom to determine intervals Ij, i = 1, . . . ,n, is one way to in- 
corporate uncertainty and maintain a reduction to n + 1 Dirac masses. In particular, 
by choosing semi-infinite intervals /j := (— oo,gj] we obtain a reduction to n -|- 1 Dirac 
masses for inequality constraints of the form IE(Q[gj] < qi, and by choosing point intervals 
li ■= teiQi] we obtain a reduction to n -|- 1 Dirac masses for equality constraints of the 
form EQ[5rj] = qi. Moreover, by choosing the interval to be semi-infinite or point interval 
depending on the index i we obtain a reduction to n -|- 1 Dirac masses for mixed equality 
and inequality constraints. 

Theorem 4.4 can be put into a canonical form in the following way: by considering 
the modified feature map ^' : ^ — t- M" with components 

^i •= 9i ° ^) for i = 1, . . . , n, 

it follows from the above that 

^-10 = {vr G M{A) I E^[^'] G /} . 

That is, by changing from the feature map ^ to ^' we end up with a constraint set defined 
by the first moment of the vector function ^'. Therefore, let us remove the ' from ^', 
and require ^ : A ^ M" to be measurable. The following theorem is the canonical form 
of Theorem 4.4. It is a corollary of Theorem 4.4 for the constraint E7r[^] G Z when 
Z = / is a closed rectangle. However, it is true for arbitrary Z C M". 

Theorem 4.6. Let A be Suslin, let^ : A^W^ he measurable, let Z C M", and let 

n:={QGA^(M")|EQ^Q[Q]GZ} (4.11) 

be the set of those measures whose first moment belongs to Z . Then, for 

n(Z) := ^-^O = {^(^M{A)\ E^[^] G Z} (4.12) 

and n(Z, n) := n(Z) n A(n), we have 

U{li{Z)) =U{ll{Z,n)) 



where 



U{li{Z,n)) 



among {fi, m) £ A,ai> 0, J27=o "j = 1 (4-13) 

juch that ZliLo "«*(/«' /^«) ^ ^■ 



Example 4.7. Let X := [0, 1], Q = M and consider the admissible set A := A^([0, 1]), 
the quantity of interest $(//) := fJ,[X > a] for some a G (0, 1), and the map ^: ^ — )■ M 
defined by ^(/i) := E^[X]. Take as the set of admissible priors vr on ^ the collection 

U:={7TeM{A)\E^^4E^[X]] =q} 
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for some fixed q G (0, a). Tlien we will show that 

U{Il) = q/a. (4.14) 

To that end, observe that since E^^t^ [IE^[X]] = E,r[^]i it follows that 

Il = {^CiM{A)\¥.^m=q}, 

SO that Theorem 4.6 implies that we can reduce the optimization in lAiJl) to the supre- 
mum over /ii, ^2 S "4, a £ [0, 1] of 

a^jLi[X > a] + (1 - a)^X2[X > a] 

subject to the constraint 

aE^,[X] + {1 - a)E^,[X] = q. 

Introducing the slack variables qi := E^JX], g2 := IE^2[^] ^^'^ using [86, Thm. 4.1] to 
reduce this problem further in /Ui,^2i we obtain that U{Il) is equal to the supremum 
over a E [0, 1] and qi,q2 S [0, 1] of 

amin{l, ^} + (1 — a) min{l, ^} 

subject to the constraint aqi + (1 — a)q2 = q- Observing that the supremum is achieved 
at qi,q2 < a, we conclude that U{Il) = q/a, establishing (4.14). Moreover, note that 
U{Il) = U{Au) for ^n defined in (4.4) instead of the general inequality U^Au) < ^(n) 
of Theorem 4.2. 

4.2.2 Nested reduction for prior values 

The result of Example 4.7 can also be deduced through a nested reduction that we 
will find generally more useful for two reasons. The first is that, in practice, not only 
is it highly non-trivial to specify a prior on the space A, since it requires quantifying 
information on an infinite-dimensional space, but it may also be undesirable to do so. 
Indeed, if an expert does not have a prior on the full space A but only on some projection 
^{A) = Q, then, rather than arbitrarily picking one particular prior on the space A 
compatible with the specified prior on ^{A), it might be preferable to work with the 
set of priors on A specified through such marginals. Our second and main motivation 
is that, even when we can do the reduction on the primary space M{A), the reduced 
space remains so large that it may not be amenable to computation. However with 
the nested reduction theorems given below, the reduced space becomes computationally 
manageable when Q is finite dimensional. 

Example 4.8. A simple example is $(/, /i) := /u[/ > a] (where a is a safety margin), 
^(/,/i) = (E^[/],Var^[/]), Q = M?, Q = {Q} where Q corresponds to the uniform 
distribution on [—1,1] x [3, 4]. In that example, the expert has only "the prior" that the 
mean of / with respect to /i is uniformly distributed on [—1, 1] and that the variance 
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of / with respect to ^u is independent of its mean and uniformly distributed on [3,4]. 
Observe that in this situation Q does not uniquely specify a prior vr G M.{A) but an 
infinite-dimensional set of priors ^^^(£}) C j\4{A) and a robust approach would require 
assessing the safety of the system under the whole set ^^""^(H) rather than under a 
particular element vr of that set. 

Idea of the nested reduction. Roughly, the idea of the nested reduction is as follows. 
To compute (4.7), consider the induced function 

defined by 

{Uo^i>-^){q):=ll{^-Hq))= sup $(/,/z), for q e Q, 

{/,M)e*-i(q) 

where we use the notation of (3.1). From this it is natural to consider 

EQ[^/o'f-i], for Q GO. 

Let Q G 0. Then, for any vr such that ^vr = Q, it follows that 

Eq[ZYo^-1] =E^^[^/o^-1] 

= E^[^o^-lo^] 

Unfortunately, it is not true that U o ^-^ o^ = ^- instead it is [U o xjf-i o ^){f,fj,) = 
sup(j/^^/v^(j/^^/w,j,(j^^) ^{f ,fj,'). However, if it were true, then we would obtain 

= E^IU o ^-1 o *] 
= E^[$] 



sup Eq[U o ^-1] = sup E^[$] = ZY(^-^£3). 



and conclude that 



We will show that, despite the fact that lA o^ ^ o^i ^ (^^ the conclusion 

U{-^-^0.) = sup Eq[U o ^-1] (4.15) 

is still valid, provided that it is interpreted correctly. Heuristically, the reason for this 
is that the supremum sup^g^-i^j in Z//(^~^0) is exploring the maximum value of $ on 
level sets of ^ very much like the supremum in {U o ^~^)(g) = sup^-i/^) <1>. 

If A is such that a reduction theorem, e.g. from [86], applies to reduce the compu- 
tation of the inner supremum in U o VJ/-! to the supremum over convex combinations 
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of Dirac masses, and the admissible set is such that a reduction theorem appHes to 
the computation of the outer supremum of sup^g^j 'E,q[L( o ^~^], then the identity (4.15) 
represents a nesting of reductions. 

Let us now estabhsh a result like (4.15). To do so will require addressing three 
questions: (1) What kind of function is Wo vl/~i? (2) What kind of measures Q G A4{Q) 
can define an integral of a function with properties discovered from the answer to (1)? (3) 
Can we obtain a measurable solution operator to the optimization problem (Uo'^^^^ (g), 
where q G Q? To that end, let us first recall a definition of universally measurable 
functions. 

Definition 4.9. Let (T, T) be a measurable space, and for a positive measure i' on 
{T,T), let Tu denote the i/-completion of T. Let T := f],^Tu, where the intersection 
is over all positive bounded measures v, denote the universally measurable sets. A 
T-measurable function is said to be universally measurable. 

At the heart of the commutative representation used for the nested reduction is the 
following optimal measurable selection lemma answering questions (1) and (3) above: 

Lemma 4.10. Let A be a Suslin space, let Q be a separable and metrizable space, and 
let ^ : A ^ Q be measurable. Then, for any subset T C ^(A), 

1. U o ^^1 is B{T) -measurable 

2. for all 6 > 0, there exists a 6 -optimal B(T) -measurable section of ^; that is, a 
B{T) -measurable function ip: T ^ A such that '^{^{qj) = q for all q € T and 

^{'il;{q))>U{^-^{q)) -6, for all q € T. 

To answer question (2) above, define a support supp(Q) of a measure Q G M.{Q), as 
in [4, Ch. 12.3], to be a closed set such that 

• Q(Q\supp(Q)) =0, and 

• if G C Q is open and G n supp(Q) / 0, then Q(G n supp(Q)) > 0. 

When Q is a separable and metrizable space, it follows that it is second countable and 
therefore, by [i, Thm. 12.14], all Q G M.{Q) have a uniquely defined support. Now 
consider a measure Q G Ai{Q) such that supp(Q) C '^{A). Then, by Lemma 4.10, 
U o vj/^^ is ;B(supp Q)-measurable. Therefore, the expected value ¥,q[U o vi/~i] can be 
defined by integration with respect to the completion Q: 

Eq[^/o^-1]:=E^[^/o^-1]. (4.16) 

More generally, for any universally measurable function / and any finite measure Q, we 
define the expected value Eq [/] of / by 

Eq[/]:=Eq[/]. (4.17) 
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Such a method of defining integrals of, possibly non-Borel measurable, but universally 
measurable, functions brings up many questions such as: when is it uniquely defined?; 
for a fixed integrand, when is the expectation operation affine in the measure?; does it 
have a change a variables formula? All such questions have nice answers and, although 
we are sure that this is classical, we cannot find a reference for these facts so we have 
included statements and proofs of the facts needed in this paper in Section 9.1 of the 
Appendix. 

We now state our nested reduction theorem of the form (4.15): 

Theorem 4.11. Let A be a Suslin space, let Q he a separable and metrizable space, and 
let "ii : A ^ Q measurable. Moreover, let Q C M.{Q) be such that supp(Q) C "^{A) for 
all Q £ Q. Then, for each Q € £J, ^~^Q is non-empty. Moreover, the upper bound 
U{"^~^0.), defined in (4.2), satisfies 

^/(^-iO) = supEq[ZYo^-1]. (4.18) 

where the expectations on the right-hand side are defined as in (4.16). Finally, the 
expectation operator on the right-hand side is measure affine in Q. 

Remark 4.12. Since the right-hand side is measure affine in Q, if Q is specified through 
(multi-)linear generalized moment inequalities, then the reduction theorems of [SG] can 
be applied to obtain the supremum over Q by reducing Q to a convex combination of 
a finite number of Dirac masses on Q. Moreover, if consists of a single element, 
i.e. = {Q}, then 

^/(^-i£}) = ^/(^-^Q) = Eq [^/ o ^-1] , (4.19) 

and the right hand-side of (4.19) can be evaluated via Monte Carlo sampling oi q G Q 
according to the measure Q. 

Remark 4.13. A similar theorem can obtained for the optimal lower bound C{^~^Q.). 
Throughout this paper, results given for optimal upper bounds U can be translated into 
results for optimal lower bounds C by considering the negative quantity of interest — $ 
and for the sake of concision we will not write those results unless necessary. 

Example 4.14. Consider again Example 4.7, where X := [0, 1], Q = M, the admissible 

set A := A^([0, 1]), the quantity of interest $(/u) := iJ,[X > a] for some a G (0, 1), the 
map ^: ^ — )• M is defined by ^(n) := E^[X], and the set of admissible priors vr on A is 
the collection 

n := {vr G M{A) I E^^^ [E^[X]] = q} . 

for some fixed q G (0, a). We will now demonstrate how the result U(Il) = q/a of (4.14) 
obtained by the primary reduction follows from the nested reduction theorem. To that 
end, observe that since ^{A) = [0, 1] Q M, by restricting to measures Q G 7W(M) with 
support supp(Q) C [0, 1], Theorem 4.11 implies that 



^^(n) = sup Eg>, 



sup fi[X > a] 

fieM{[0,l]) :E^[X]=q' 
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(4.20) 



where £} is the set of probability measures Q on M with support contained in [0, 1] such 
that Eq[(5] = q. Theorem 4.1 of [86] shows that the inner supremum of /i[X > a] can 
be achieved by assuming that /i is the weighted sum of two Dirac masses, i.e. 

sup /x[X > a] = sup {a5x^ + {I - a)5x^)[X > a]. (4.21) 

MeX([0,l]) a,xi,X2&[Q,l] 

E,4X]=g' axi + {l-a)x2=q' 

For q' > a, the supremum in the right-hand side of (4.21) is 1, and for q' < a, the 
supremum in the right-hand side of (4.21) is achieved by X2 = 0, xi = a and a = q' /a, 
and so we conclude that 

sup IJ,[X > a] = min{l, — }. 

MeX([o,i]) 

E^[X]=q' 

Hence, by identifying the measures Q G A4{M.) with support supp(Q) C [0,1] with 
A1([0, 1]) in the obvious way, (4.20) becomes 

U{U) = sup Eg,^Q [minjl, ^jl . (4.22) 

QeA^([o,i]) L J 

lEQ[Q]=g 

Using [:■■, Thm. 4.1] again, we obtain that the supremum in Q in the right-hand side of 
(4.22) is equal to the supremum over a, qi,q2 S [0, 1], of 

amin{l,f } + (l-a)min{l,f } (4.23) 

subject to the constraint that aqi + {1 — a)q2 = q- This supremum is achieved by qi = a, 
q2 = and a = f , and so we obtain that ^(11) = q/a, in agreement with (4.14). 

5 Optimal bounds on the posterior value 

What happens to the optimal bounds (4.2) and (4.3) on the prior value IE,r[^], investi- 
gated in Section 4, after conditioning on the data? Does the interval corresponding to 
these optimal bounds shrink down to a single point as more and more data comes in? 
Does this interval shrink as the measurement noise on the data is reduced? What hap- 
pens to posterior estimates associated with two distinct but close priors, possibly sharing 
the same marginal distribution on a high dimensional space? These are the questions 
that will be investigated in this section. Our answers will show that: (1) optimal bounds 
on posterior estimates grow as data comes in; (2) optimal bounds on posterior estimates 
grow as measurement noise is reduced (3) two priors sharing the same high-dimensional 
marginals can lead to diametrically opposed posterior estimates. Although the Bayesian 
framework is a standard method for constructing a statistical estimator, the surprising 
answers provided in this section will show that one should be cautious with its direct 
application to UQ. 
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As discussed in Section 3.5, let us now consider the case where, for a fixed (/, /i) G A, 
there is some uncertainty regarding the observation process D(/, /x). Instead of repre- 
senting this uncertainty by generahzing from O: A ^ "D being a function to D : A ^>^ "D 
being a set- valued map, we have, for simplicity, chosen to express this uncertainty by 
specifying a set 1) of mappings from A to T> and to express our information regarding 
D through the assumption D G 1). We can easily generalize the notation of Section 3.3 
to this more general situation as follows: for an admissible set n C 7W (.4) of priors, and 
a set Ti of observation maps, let 11 2) be the set of probability distributions vr D on 
AxV generated by vr € 11 and D G D: 

n0D :={7r0D I vr en, DgD}. (5.1) 

As shown in Section 3, directly conditioning measures 7r0D with respect to the random 
variable D representing the observed sample data would require manipulating regular 
conditional probabilities on ^ x P. 

Furthermore, in Bayesian statistics a prior n may represent a "subjective belief" 
about reality and, in such situations, the data may be sampled from ir^ ■ D^^ which may 
be distinct from vr • D. In Bayesian statistics vr^ is called the "true" (or sometimes 
"objective") prior and vr a "subjective" prior (see [: ] and references therein). Although 
it is known that the subjective prior vr might be distinct from the true prior vr^, one may 
still try to evaluate the conditional expectation of the quantity of interest ^ using vr as 
the distribution on A. We will show here that although the observation of the sample 
data d does not uniquely determine the true prior vr^ and the true data map D' , it does 
determine a random subset of A^(^) x S (i.e. a random subset of priors and data maps) 
denoted TZ{d) such that, with vr'^ • D''^ probability one, (vr^,D''^) E TZ{d). This observation 
is based on the following fundamental lemma: 



Lemma 5.1. For a strongly Lindelof space y and a Borel measure v on B(y\ define 
E:={yey 



there is an open neighborhood Oy 
of y such that v{Oy) = 



Then u{E) = {) 



Remark 5.2. Recall that a Lindelof space is a topological space such that any open 
cover has a countable subcover and a strongly Lindelof space is such that any open 
subset is Lindelof. Since T> is assumed to be Suslin from Section 3.3, and Suslin implies 
strongly Lindelof, Lemma 5.1 shows that any open neighborhood B^ of any observed 
value d €z D has nonzero measure with probability 1. 

Remark 5.3. Any separable Hilbert space, in particular the Euclidian space M , is 
strongly Lindelof. In this situation. Lemma 5.1 implies that if for any observation y 
generated by a law v £ M{y) we place an open ball B{y,r{y)) of non-zero radius 
r{y) > about y, then with i/-probability 1 we have z^(-B(y, r(y))) > 0. That is, 

u{{yey\u{B{y,r{y)))>0})=l. 
30 



Now suppose the data d are generated according to a probability measure vr^ • H^ 
(where tt' is the "true" prior and W the "true" data map). We conclude from Lemma 5.1 
that when we observe a sample d, if we assume that (7r"f,D"f) € TZ{d) where 

TZ{d) := {{7r,B) £ M{A) X 1) \tt ■ B[B] > for all B open containing d} , 

then we will be correct in this assumption with vrT -DT.pj-obability 1. Therefore, when the 
data d are generated and we observe that d G B^ where B^ is an open subset containing 
the data d (to keep our notation simple, we will, later on, drop d in the notation i?^), then 
we restrict our attention to priors vr G 11 and data maps D G S such that vr • ©[-B^i] > 0. 
That is to say, we restrict our attention to the intersection of 11 S with the set of 
measures vr D such that (vr, D) £ M{A) x Tl and vr • 0[Bd] > 0. We write 11 0^^ 3 for 
this intersection, i.e. 

n 0B^ D := {vr D I (vr, D) G n X D and vr • D[5d] > 0} . 

If n Qb^ S is void, then we assert that "vr^ W is not contained in 11 S" and we 
know that this assertion is true with vr"^ • D"f-probability 1 on the realization of the data 
d. Conversely, if vr^ W is contained in H Q T), then 11 0^^ T) must, with vr^ • D'- 
probability 1 on the realization of the data d, still contain vr^ H^ (in particular it must 
be non-empty). 

Happily, this approach also facilitates the efficient computation of the conditional 
expectations because now they have a simple representation. Indeed, consider the condi- 
tional expectation of an object of interest ^ given a prior vr and data map D, conditioned 
on a subset B G 8(1)) such that vr • D[5] > 0. It follows from (3.4) and (3.5) that the 
conditional expectation of $ given B is 

^ r^lRi. ^if,^,d)-nQ»[Hf,f^nB{d)] 
^^e»mB\ .= ^-^^^^ , 

which, using (3.4) and (3.5), leads to 

KQli[<P\B\ - ^—— — —— . (5.2) 

Moreover, recall that this conditional expectation is the best mean squared approxima- 
tion of $ under the measure vr D, given the information that D £ B, i.e. 

E^0d[$|5] =argminE^0iD, {<^ - mf B . (5.3) 

meR L J 

Consequently, for any open subset B C 2?, we define 

n 0B D := {vr D G n 3 I (^ • D)[S] > 0} . (5.4) 

where, by (3.5), 

^•D[i3]:=E(;,^)^4D(/,^)[S]]. (5.5) 
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Then, since (vr • D)[i3] > 0, the formula (5.2) for conditfonal expectation imphes that 

UiUQB^):= sup E^Q„[<^\B] (5.6) 

£{UQb^)--= inf E^qMB] (5.7) 

where 

Finally, if B is an open neighborhood containing the sample data d, then it follows that 
Z//(n 05^ 2)) and £(n 0^^ Tl) are optimal upper and lower bounds on the posterior 
values E^0d[$|-B], given the observation D £ B, over all vr G 11 and D G 2) such that 
7r-B[B] >0. 

Example 5.4. When $ is the indicator function of the set {(/, /u) | /x[/ > a] > e} (i.e. 
the set of unsafe (/, fi)), UCHQb'^) and C{IIQb^) are optimal upper and lower bounds 
on the "posterior probability" that the system is unsafe given the observation D £ B 
(and the sets 11 and S of priors and observation maps respectively). 

5.1 General information barriers on posterior values 

Now let -B C 2? be open and let 

Auob^ := I (/, /i) G ^ (^(/,^) G n and sup D(/, fi)[B] > o] , (5.9) 

I Del) J 

^{■^uqb^) '■= sup ^(/,/U), 

(/,/»)G^n0sD 

and use C for the corresponding infimum. The following theorem is a straightforward 
consequence of (5.2): 

Theorem 5.5. It holds true that 

K{AnQB^) < ^(n Qb S)) < U{A), 

and 

C{A) < C{U Qb D) < C{Auqb^). 

Moreover, if AuQg^ is non empty, then 

C{A) < £(n Qb D) < C{An&Bs) < U{An&ss) < U{U Qb S)) < U{A). 

Remark 5.6. The dependence of Z//(^n0sD) and £(^n0sZ)) on the sample data is very 
weak. In particular, if 2) = {D} and D corresponds to observing i.i.d. realizations of {X+ 
^, f^{X) +0 where £, and ^' are centered Gaussian random variables of arbitrarily small 
(non zero) variance, then it can be shown that U{Auqbs) = U{Au) and C{AuQg-s) = 
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C{Au)- In that situation, if C{Au) < h({An), then U{AuQgs) — ^{AnQgi)) remains 
bounded away from by a strictly positive constant that is independent of S and B, 
which, in particular, implies that the range of achievable posterior values cannot shrink 
towards ^{f\fjL^) regardless of the number of observed i.i.d. samples. The presence of 
such information barriers suggest that the consistency of Bayesian estimators cannot 
be established independently (uniformly) in the choice of priors (this point will also be 
substantiated by Theorem 5.12). 

5.2 Primary reduction for posterior values 

As in Section 4.2.1, when priors are specified through finite-dimensional inequalities, it 
is possible to provide a reduction of the computation of Uiji Qb 2)) on the primary 
space. To that end, let M.+{A) denote the set of postive bounded measures on A and 
let us extend the "expectation notation" to mean integration with respect to a positive 
measure in the natural way: for a measurable function ^ and a tt-i- € Aijf-{A) define 

^^M ■■= [ V'd7r+ 

Ja 

if the integral exists. 

Let ■i/'o, ■ ■ ■ , V'n be real- valued measurable functions on A and define 

n+ := {7r+ G M + {A) I K^+iV'o] = 1, and E^+iV-i] = for i = 1, . . . ,n} , 

where implicit in the definition is that all n + 1 integrals exist, and let 

n+,„ := n+ n A(n) 

be the set of those measures in n_|_ that are non-negative sums of n + 1 Dirac masses. 
The following theorem is a generalization of [86, Thm. 4.1] to positive measures (see also 
[121, Thm. 3.2] from which the proof of [ ", Thm. 4.1] was derived). 

Theorem 5.7. If A is a Suslin space, then 

sup E^+[$]= sup E^+[$]. (5.10) 

TT+&1+ 7r+en+,„+i 

Furthermore, if V'o is non-negative on A and there exists a measurable function ip such 

that ^ = TpQip, then 

sup E^_^[$]= sup E^_^[$]. (5.11) 

7r+en+ 7r+Gn+,„ 

Theorem 5.7 can be used to produce a primary reduction oilA(jl Qb ^) when 11 is 
defined by a finite number of equalities. To state the theorem, recall that, for arbitrary 
n, T) and -B, the definition 

n0BS):={7r0Den0D|7r- D[S] > 0} 
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of (5.4), where by (5.5) 

^•D[S]:=E(;,^)^4D(/,^)[i?]]; 
recall also the notation of (5.6) 

U{UQb^):= sup E^Qn[^\B]; 

and recall the result (5.2) that, for any vr D G 11 0^ 2), 



E^Q„[^\B] 



E(/,;.).4lD)(/,/u)[i?]] 



The proof of the following theorem is obtained by first proving the theorem for equality 
constraints Z = {q}, by observing that W (11(^)0^0) is a fractional optimization problem 
in TT and utilizing the fact that such problems are equivalent to linear problems [^>1], and 
then applying Theorem 5.7. To extend the result to the subset Z C M", one uses 
a layercake approach as in the proof of Theorem 4.6. As in Section 4, the following 
primary reduction theorem. Theorem 5.8, will be formulated in canonical form and the 
nested reduction theorem. Theorem 5.10, will be in the general form. 

Theorem 5.8. Let A be Suslin and let '^ : A ^ M" be measurable. For Z C M"^ let 
n(Z) := {vr e M{A) \ E^[^] G Z}. ThenU{U{Z) Qb V) is equal to the supremum over 
OeV, ai>0, qe Z and {fi,tii) £ A of 



^aMf„^liMfi,^i^)[B] 



i=0 



subject to the constraints 



and 



Y,ai{'^{fi,l^i)-Q) =0 



i=0 



^a^O{f,,^Ii)[B] = l. (5.12) 

4 = 

Example 5.9. Consider again Example 4.7 with the admissible set A := M{[0, 1]), the 
quantity of interest ^{fi) := fi[X > a], the map ^(/u) := E^[-Y] and the set of admissible 
priors 

U:={TreM{A)\E^^4E^[X]] = q} . 

for some q G (0,a). We saw in Example 4.7 that ^(11) = -. Now suppose that we 
observe the random variable D := {Xi, . . . ,Xn) corresponding to n i.i.d. samples of 
/i' G A. More precisely, we observe D ^ B where B = Bi x ■ ■ ■ Bn and Bi is the ball in 
(0, 1) of center Xi and radius p, Xi G (0, 1) and < p <^ 1/n. Let D" denote the data 
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map corresponding to taking n i.i.d. samples, that is, D"(/i) := /i (g) • • • (g) /x, and observe 

Theorem 5.8 impHes that Uijl Qb ID") is equal to the supremum over ai,a2 > 0, 
/ii , /i2 e ^ of 

aifii[X > a]D"(Aii)[S] + a2A^2[^ > a]ID)"(/i2)[5] 

subject to the constraints 

ai{E^,[X] - q) + a2{E^,[X] - q) = 0, 

aiO''{fii)[B] + a2B"ifi2)[B] = l, 

with D"(/x)[i?] = YYi=i fJ-iBi). Introducing slack variables /3i^j := fii[Bi] and /32,j := 
//2[-Bi] as n linear constraints on fii and n linear constraints on ^2 we obtain (from 
[86, Thm. 4.1]) that the supremum can be achieved by assuming that each ^Uj is the 
weighted sum of at most n + 2 Dirac masses. Assuming that the Bi are non intersecting 
balls of radius p <C 1/n centered on xi, . . . , Xn, n of these Dirac masses will have to be 
put at xi, . . . ,Xn', for optimality, the two others will have to be put at and a (with 
weights pi and ^2)- Introducing 71 = QiD"(/ii)[i?] and 72 = a2lO"(/x2)[i?], it follows 
that W(n Qb O") is equal (as p | 0) to the supremum over 71, 72 > 0, pi,P2 £ [0, 1] of 

7iPi + 72P2 

subject to the constraints 

71 + 72 = 1, 

and 

{api + Er=i Xi/3i,i) - q {ap2 + ^"=1 Xi/32,i) - q ^ „ 

^' nr=i/?M ^' nr=i/32. 

By considering < /Sjj ^ 1 it is easy to obtain that ^(11 0^ D") = 1. 

5.3 Nested reduction for posterior values 

Here, as in Section 4.2, we show how the optimization problems (5.6) and (5.7) can 
be reduced to nested OUQ optimization problems (i.e. nested problems analogous to 
(3.1) and (3.2)) when the collection II of admissible priors is defined by how they push 
forward by a measurable mapping ^ : ^ — )• Q. That is, we specify a feature space Q, a 
measurable map ^: ^ — )■ Q, a subset Q '^ Ai{Q) and define the admissible set of priors 

by 

n := ^-^Q = {vr G M{A) \ ^vr G £}}. 

As before, we focus on reducing the upper bound 

U{^-^QQb^) ■■= sup E^Qo[^\B]. (5.13) 

TrODG'I'-liJOsZ) 
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Theorem 5.10. Let A he a Suslin space, let Q he a separable and metrizable space, and 
let '^■. A ^ Q he measurahle. Moreover, let Q C M{Q) he such that supp(Q) C ^(^) 
for all Q G £2. Then, for each Q G 0, ^^^Q is non-empty. Moreover, the upper hound 
U{^^^O.Qb '^)> defined in (5.13), satisfies 




sup Eg^ 

Qe£J 



sup mf,fi)-Xpif,fi)[B] 
(/■M)G*-i(g) 



>0 



> , 



(5.14) 



where the expectations on the right-hand side are defined as in (4.17). Finally, the 
expectation operator on the right-hand side is measure affine in Q, as defined in (4.3). 

Remark 5.11. Note that Theorem 5.10 is more general than Theorem 5.8 because 
its apphcation does not require the assumption that ^~^£2 is defined via generahzed 
moments constraints. 

The following theorem is our Main Brittleness Theorem. It shows not only that the 
right-hand side of the assertion (5.14) of Theorem 5.10 depends on the sample data in 
a very weak way, but also that under very mild assumptions the observation of this 
sample data leads to an increase (rather than a decrease) of the least upper bound on 
the quantity of interest: 

Theorem 5.12. Let A he a Suslin space, let Q he a separable and metrizable space, and 
let ^ : A ^ Q he measurahle. Moreover, let £l C M{Q) he such that supp(Q) C ^{A) 
for all Q € £2. Suppose that, for all 6 > 0, there exists some Q G £1, D G S such that 



E„ 



inf 0{f,fi)[B] 







and 



Then 



sup Hf,f^)> sup $(/,/i)-5 



ZY(^-^(£))0bS)) =UiA). 



>0. 



(5.15) 



(5.16) 



(5.17) 



Remark 5.13. Note that the convention that sup^.^^ V^{x) = — oo if A is empty implies 
that, if the assumption (5.16) is satisfied, then there is a measure Q G £1 such that the set 
of q such that D(/, //)[i?] > for some (/, /i) G "^~^{q) has strictly positive Q-measure. 

Remark 5.14. Theorem 5.12 states that if there exists Q G £1 putting some mass on a 
neighborhood of the values g of ^ where s^P(f,^)e^-^(q) ^(/)/^) achieves its supremum, 
then 
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On the other hand, Theorem 4.2 asserts that 

U{'^~^il)<U{A) (5.18) 

so we conclude that 

U{^-^£1)<U{^-\Q)Qb^), (5.19) 

That is, observing the sample data does not improve the optimal bound! Moreover, when 
the inequahty (5.18) is strict, if we define 

6:=U{A)-U{'^-^£1) >0 

then it follows that 

U{^-^Q)+6<U{^-\Q)Qb^), (5.20) 

from which we conclude that when the inequality (5.18) is strict, observing the sample 
data makes the optimal bound worse! In other words, after the observation of the sample 
data (which may be limited to a single realization of (X, P{X)) under the measure /i' , 
or an arbitrary large number of independent samples of (Xj, /^^(Xj)) the optimal upper 
bound on the quantity of interest 

U{^-'£})= sup E(^,^)^4$(/,;U)] 

increases to 

U{A)= sup <^{f,fi). 

Example 5.15. Consider A := M{[0,1]), <^in) = E^[X], D"(^) := /x (^ • • • (g) /i. In 
this example are interested in estimating the mean of X under some unknown measure 
/i' G ^ and we observe d = (di, . . . ,d„), n i.i.d. samples from X; note that n can 
be very large. The sample data contain information on fj,^ through the fact that their 
distribution is D"(^"^) = fi^ (E> ■ ■ ■ (E> fi^ (i.e. although the distribution of the sample data 
is unknown, its dependency structure, as a functional of /i' , is known). 

Let A; be a (possibly large) number. Define 11 to be the set of priors vr under which 
the distribution of (E^[X], . . . ,E^[X'^]) is Q, where Q is a distribution on R^ such that 
E^[X] (its first marginal) is uniformly distributed on [0, 1] and such that the (conditional) 
distribution of E^[X^] conditioned on E^[X] = qi is the uniform distribution on the 
interval 

inf E^[X2], sup E^[X^ 

/^G^,E[X]=gi fieA,E[X]=qi 

and such that the conditional distributions of the other marginals E^[X^] are defined 
iteratively in the same manner. For this example, note that ^I'(//) = (E^[X] , . . . , E^[X ]). 
Note that, for q := {qi, . . . ,q^) in the range of ^ (i.e. '^{A)), '^~^{q) is the subset of 
measures /Lt G A^([0,1]) such that E^[X*] = gj for 1 < i < /c. Let B be defined as 
Bi X • • • Bn where each Bi is a ball of radius p containing di . 
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We will now use Theorem 5.12 to compute optimal bounds on the posterior values 
of '^(/u) = E^[X]. We will focus our attention on the upper bound. First observe that in 
this example £1 is reduced to the single measure Q constructed above and £) is reduced 
to the single data map D". 

Let us first check that condition (5.16) is always satisfied (irrespective of the value 
of the data d). Note that condition (5.16) is satisfied if for all (5 > there exists a 
subset of values of q of strictly positive Q- measure such that {/i G "^^^{q) \ D"(/i)[i?] > 
and IE^[^] > 1 — 5} is non empty. So, let 6 > be arbitrary and define /i^ to be the 
empirical distribution of d, i.e. 

fJ-d ■= 

n 

Define 

As:={fieA\ Ef,[X] > 1 - S/2}. 

One can show by induction that ^{As) has a non-empty interior and that any open subset 
of 'i'(^) has strictly positive Q-measure. Let q* be a point in the interior of ^(^5), and 
let B{q*,T) be a ball of center q* and radius r such that 5(g*,2r) is contained in the 
interior of ^(As)- Note that B(q*,T) has strictly positive Q-measure. Furthermore, for 
e sufficiently small, for each q G B{q*,T) there exists q' S B{q* ,2t) and /i G '^^^{q') 
such that ^, := (1 - e)n + e/i^ E ^~^{q). Since D"(//,)[5] > and E^[X] > 1 - 6/2, it 
follows that (5.16) is satisfied (irrespective of the value of the data d). 

Let us now consider condition (5.15). Observe that condition (5.15) is satisfied if for 
Q-almost all q G "^{A) and all e > 0, there exists /i G ^"H^) such that D"(^)[B] < e. 
Assume that d contains at least k + 2 distinct points and that p is strictly smaller than 
half of the minimal distance between two of such points, so that the associated Bi do 
not overlap; note that this assumption is satisfied with probability converging to one (as 
n —7- 00) if the data are sampled from a measure /i^ that is absolutely continuous with 
respect to the Lebesgue measure on [0,1]. Let q G ^(.4); by the reduction theorems 
of [86] there exists /ig G ^~^{q) such that /ig is the weighted sum of at most fc + 1 
masses of Diracs on [0, 1]. Since there exist at least k + 2 non-overlapping Bi we have 
D"'(//g)[i?] = which implies condition (5.15). Hence, Theorem 5.12 implies that, for 
this (possibly) highly constrained problem characterized by a (possibly) large number of 
sampled data points, the optimal bounds on the posterior values of IE^[^] are zero and 
one whereas the set of prior values of IE^[-'^] is the single point {g}. 

Remark 5.16. For a thorough analysis of Example 5.15 we refer to [N-^'i] where, in partic- 
ular, a quantitative version of Theorem 5.12 is developed and then applied to Example 
5.15. This application also leads to the discovery of a new family of Selberg integral 
formulas through a refined analysis of the integral geometry of the Hausdorff moment 
space through the revelation that the free parameter associated with Markov and Krem's 
canonical representations of truncated Hausdorff moments generates reproducing kernel 
identities corresponding to reproducing kernel Hilbert spaces of polynomials. 

Remark 5.17. Note that the assumptions of Theorem 5.12 are extremely weak. In 
plain words, Theorem 5.12 implies that if the probability of observing the data can be 
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arbitrary small under priors contained in A that are putting mass near the extreme 
values of ^, then the optimal bounds on posterior values are the extreme values of $ in 
A (even if the data comes in the form of a large number of samples and the set of priors 
is highly constrained). Example 5.15 illustrates that one consequence of Theorem 5.12 is 
that Bayesian posteriors are not robust, and in fact are fragile with respect to the choices 
of priors constrained by marginals, even with a highly constrained subset of priors of 
A4{A). Moreover, if 11 is convex, then by considering priors of the form ttqX + (1 — A)7ri 
with 7ro,vri G IT, ttq • 1S)[B] > and vri • D[i?] > 0, it is easy to see that the Bayesian 
posterior can be "anything you want" in the interval (^C{A),h({A)) (irrespective of the 
data) in the sense that, for any value / in that interval, there exists a prior vr G $~^(£}) 
whose posterior value is I. In addition, it is easy to observe that including the quantity 
of interest $ in the marginal ^ does not prevent this fragility. Theorem 5.12 also leads to 
the following apparent paradoxes when the Bayesian framework is applied to the space 
A: (1) Posteriors with different priors may diverge as more and more data comes in; 
(2) When the sample data is observed with some (say Gaussian) measurement noise 
of variance o"^, then, writing D(cr^) for the associated measurement map, the optimal 
hound U[^~^{£1)Qb^{o-'^)) on the quantity of interest $ converges towards Z^(^~^(£))) 
as cr^ — 7- oo. That is, if one interprets optimal bounds on posterior values as uncertainty 
bounds, then one would reach the paradoxical conclusion that adding measurement 
uncertainty decreases the uncertainty of the quantity of interest. The idea of the proof 
of this assertion is based on the following observation: 

Let y be the (noisy) measurement whose distribution given the value of the data d is 
assumed to be independent of {f,fJ.)- Write po-{d)[B] for the probability that the value 
of y belongs to a set B and observe that the conditional value of the quantity of interest 
^ given the y € i? is equal to 



E, 



<!>(/, ^)Erf^B(/,M)[p<x(d)[i?]] 



E^ 



^dr^B>if,f,)[P<T{d)[B]] 



(5.21) 



We deduce that if pa{d)[B]/pa{d')[B] converges towards one as the level of noise cr — )• oo 
uniformly in {d,d') S [0, 1]^ (which is the case if the data in Example 5.9 is observed 
with Gaussian noise of increasing variance, see also Example 5.18 below), then (5.21) 
converges towards the prior value of $ as u — t- oo uniformly in it. 

Example 5.18. Consider again Example 5.9 with the set of admissible priors vr on ^ 
defined as the collection 

U:={7t€M{A)\E^^4E^[X]\ =q}. 

and the map D"" corresponding to the observation of n i.i.d. samples of /i. For q £ (0, a), 
let O. be the set of probability measures Q on [0, 1] such that Eg'^Q[g'] = q. Let Q be 
the probability measure on [0, 1] with probability density function p{x) = (1 — q)/q on 
[0, q] and p{x) = g/(l — q) on (q, 1]. It is easy to check that Q G £2, that 



Eg'^ 



inf Y\t^[Bi 

li(^A:E^[X]=q' -^J- 
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0, (5.22) 



and that, for sdl 5 > 0, 

sup E^[X]>l-6 > 0. (5.23) 

^leA■.E,,lx]=q',U7=l^^[B^]>o 

It follows from Theorem 5.12 that 

U{^-\a)QB^) = l. (5.24) 

Remark 5.19. It is known from the Bernstein-von Mises theorem [24, 110] that, in 
finite-dimensional situations, posterior values converge towards the quantity of interest 
if the prior distribution has strictly positive mass in every neighbourhood of the truth 
(see also [76, 84]). It is also known that "even for the simplest infinite-dimensional 
models, the Bernstein-von Mises theorem does not hold" [37, 53]. This possible lack of 
convergence, referred to as the consistency problem, has been at the center of a debate 
between frequentists and Bayesians. We quote Diaconis and Freedman [12] (see also 
[4.3]) 

"If the underlying mechanism allows an infinite number of possible outcomes 
(e.g., estimation of an unknown probability on the integers), Bayes estimates 
can be inconsistent: as more and more data comes in, some Bayesian statis- 
ticians will become more and more convinced of the wrong answer." 

What is the significance of Theorem 5.12 in that discussion? To answer this question, 
consider Example 5.9 (and 5.18), in which one is interested in estimating the probability 
(under the unknown measure fi') that X exceeds a after observating n independent 
samples. We already know from [42, 37] that placing priors on the infinite dimensional 
space A = A^[0, 1] of probability measures on [0,1] is unlikely to lead to Bayesian 
posteriors that will converge towards the true value as more and more data comes in. One 
strategy to circumvent this lack of convergence would be to consider a finite-dimensional 
subset of A, i.e. a family (^x) of probability measures on [0, 1] indexed by a finite- 
dimensional parameter A € M , put a strictly positive prior p on A G M , and then invoke 
the Bernstein-von Mises theorem to guarantee the convergence of posterior values. 

However, the Bernstein-von Mises theorem requires that the true distribution under 
which the data is sampled belongs to {//a | A € M }, the parametrized finite-dimensional 
subset of A. What happens when this is not the case, i.e. the situation of misspecifi- 
cationl Write vr^ for the push-forward of the prior p on A E M to a prior on A under 
the map \ >-^ fi\. Assume that 1) = {D} and that the data have been sampled from 
vr' • D where vr' is the (frequentist) true distribution. Here Theorem 5.12, as illustrated 
in Example 5.15, can be used to show that the posterior values of the quantity of interest 
under TTp and tt' may lie near the opposite extreme values oi ^ in A even if (1) vtT is a 
Dirac mass on a measure fj.^ S A; (2) the number of independent samples is large; and 
(3) k is large and k moments of /j.^ and fj,\* are equal for some A* G M'^. 
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5.4 Min-Max Bayesian posterior 



Equation (5.3) implies that 



sup arg min Ettqi 

7r0DG<I'-l(i3)0B2) meR 



($ - m)' 



B 



(5.25) 



and Theorem 5.12 shows that, under very general conditions, W(^~^(i2)0bS) = ^(^), 
which implies the fragility of the Bayesian posterior with respect to the choice of the 
prior. It is natural to wonder whether this fragility can be remediated by using a min- 
max version of the conditional expectation defined by switching the positions of the 
supremum in vr with that of the minimum in m in (5.25). More precisely, using the 
notation of Section 5.3, we define the min-max conditional expectation as 



arg min sup E,, 

meM 7r0l[I)e*-i(Q)0i3S) 



$ 



m) 



B 



(5.26) 



Although we have not found previous references to our version of a min-max Bayesian 
posterior value, our definition is motivated by: 

1. The observation that min-max definitions have previously been employed in mak- 
ing decisions or predictions under uncertainty. We refer, for instance to the intro- 
duction of the worst-case conditional Value-at-Risk and its application to robust 
portfolio management [123] and to robust min-max portfolio strategies for rival 
forecast and risk scenarios [93]. 

2. The question of whether such definition could resolve the lack of convergence of 
the posterior value. 

The following theorem shows the answer to (2) is no, and that, in particular, (5.26) is in 



general equal to the midpoint of the OUQ interval [£(^),Z^(^)] , i.e. 
the min-max conditional expectation cannot converge towards <I>(/^,/i^). 



U{A)+C{A) 



Hence, 



Theorem 5.20. Let A he a Suslin space, let Q he a separable and metrizahle space, and 
let "$ : A ^- Q he measurahle. Moreover, let £l C M{Q) he such that supp(Q) C "^(A) 
for all Q G Q. Suppose that, for all 5 > 0, there exists Q € Q, D G 2D, such that 



E„ 



inf B{f,^l)[B] 



0, 



sup $(/,/x)> sup ^{f,n)-d 



>0, 



(5.27) 



(5.28) 



and 



Then 



inf Hf,f^)< inf <^{f,fi) + 6 



arg min sup IEttod ($ — m)' 



B 



> 0. (5.29) 

U{A)+C{A) 
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B{f\fi)[B] 



B{iP{q))[B] > 




a^) £*-'(?) 



^^==^4-^ q e *(^) 



fi)[B]=0 



Figure 6.1: Illustration of Conditions (6.1) and (6.2) of Theorem 6.1. If, for some data 
map D G S, all level sets of ^ go to zero (i.e. for all q G ^(.4,), inf(j^^)g^-i(g) D{f, fj,)[B] = 
0), then, for any positive section ^ of ^ (i.e. ^ o ■ip(q) = q and D{ip{q))[B] > for 
q £ ^{A)), the least upper bound on posterior values is bounded from below by the 
essential supremum oi ^ o ip. 

6 Brittleness under Local Misspecification 

We now establish a corollary to the proof of Theorem 5.12 which we will then use to 
establish an extreme brittleness theorem for a Bayesian model with local misspecification. 
Recall that, for a map ^: ^ — )• Q, a map ?/;: ^{A) -^ A is called a section of ^ if 
qr o ^(g) = q for all q € ^(.4). 

Theorem 6.1. Let A be a Suslin space, let ^i A — )■ M be measurable, let Q be a 
separable and metrizable space, and let "^ : A ^ Q measurable. Let O. ^ M{Q) be such 
that supp(Q) C ^{A) for all Q G £2. Let the data space D be metrizable and consider 
B € B{T)). Let D G S 6e such that all the level sets of ^ go to zero, in the sense that 



inf ] 
(/,M)e*-i(g) 



K/,M)[i?]=0, forallqe^{A). 



Then for any positive measurable section ip of^, positive in the sense that 

B{7P{q))[B] > 0, for all q G ^{A), 
it follows that 

where 0°° ($ o -0) is the essential supremum 

Q°°{^otP) ■= sup inf {r G M : Q[<I> o V' > r] = O}. 



(6.1) 

(6.2) 
(6.3) 

(6.4) 
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See Figure 6.1 for an illustration of Theorem 6.1. 

We now use Theorem 6.1 to develop a brittleness theorem for a Bayesian model with 
local misspecification. To that end, let Af be a Polish space so that, by [ I, Thm.15.15], 
Ai{X) endowed with the weak-* topology is Polish. Moreover, by [ 1^, Thm. 11.3.3], we 
know that if we select a complete consistent metric d for X, then the Prokhorov metric 
d-M defined by 

dM{^^l,^^2) :=inf{e>0|/ii(^) < fi2{A') + e iov all A £ 13{X)} , 

where 

A^ := jx G Af I d{x, x') < e for some x' G A^ 

is the e neighborhood of A, metrizes the weak-* topology onM{X). Moreover, Prokhorov's 
Theorem [4S, Cor. 11.5.5] asserts that the Prokhorov metric dj^ is a complete metric for 
the Polish space M{X). For a > 0, // G M{X), let Ba{ij) := {//' G M{X) \ dM{^J',^J'') < 
a} be the open ball of Prokhorov radius a about fi. 

Let G be a Polish space and let the Bayesian model define a map 

V: e^M{X). 

As in Section 2, the image V{Q) is referred to as the (Bayesian) model class. 

Remark 6.2. When V is continuous, it follows from the definition [6, Sec. 3.2] of an 
analytic set that the the image ^(0) C Ai{X) is analytic, and since the range space 
A^(A:') is Polish it follows that 7^(0) is Suslin. Actually, continuity is not required, since 
[6, Thm. 3.3.4] implies that if V is measurable, then the image V{Q) is Suslin. If, in 
addition, V is injective, then Suslin's Theorem [G, Thm. 3.2.3] implies that 7^(0) is Borel. 

Assume that V is measurable and denote its image by ^o := 'P(0)- Since ^o ^ 
A4{X) is Suslin, it follows from Lemma 7.2 that the push- forward operator V : A^(0) — )• 
A^(^o) is affine continuous. Let Tre G A^(0) be a prior distribution on and let 
vTo := Pvre G A^(^o) be its pushforward. 

Let ^q: M{X) — t- M be a measurable quantity of interest. We are interested in 
estimating <l>o using the prior ttq and our purpose is to show the extreme brittleness of 
this estimation under arbitrarily small perturbations of the model class Aq in both the 
Prokhorov and total variation metrics. 

For conditioning on observations, let the data space be P := X'^, and consider the 
n-i.i.d. sample data map Dq : M{X) — > M{X"-) defined by 

%H:=fi"', fi£M{X). (6.5) 

For x" = (xi, . . . ,Xn) G X"", dropping the notational dependence, denote the rectangle 
about x" by 



B^ ■.= l[Bs{xi), (6.6) 

where Bs{xi) is the open ball of radius 6 about Xj. 
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fJ'2 








A, = v{@) = {v{e)[] \ee@] 



Figure 6.2: Illustration of ^O) ■^a and {fii,ij,2) G A. 
Observe that the prior value of $o under ttq is Et^q [^q] and its posterior value under 



the observation d G -B^ is E. 



7ro0BnID>SL'*'Oj- 



To define a-perturbations of the model class ^o in Prokhorov metric, we introduce, 
for a > the a-neighborhood Aa ^ A^(^) of ^o defined by 



Aa := U ^"(^)- 

It is easy to see that the ball fibration (see Figure 6.2 and Remark 6.9) 
A := {{m, fi2) (^M{X) xM{X)\ni G A,/"2 GSa(/xi)} 
of the set of balls about points of ^o projects to 

PoA = Ao 



(6.7) 



(6. 



(6.9) 
(6.10) 



where Pq : Ai{X) x A4{X) — )• Ai{X) is the projection onto the first component and Pa 
the projection onto the second. The naturally induced set of priors corresponding to 
ttq G A^(^o) is therefore the set IIq, C M.{Aa) defined by 



Ua := {iTa G A^(^Q,)|37r G M{A) with Pqtt = ttq and PqTT = iTa} ■ 



(6.11) 



Remark 6.3. Observe that each element tTq, G Ha is the distribution of a random 
measure //2 on Aa such that: (i) there exists a random measure fii G .4o with distribution 
ttq (that of the Bayesian model) (ii) (/xi,/X2) is jointly-measurable (iii) with probability 
one the Prokhorov distance from fi2 to ni is less than a, i.e. dj^^^i, ^2) < ct. Observe 
in particular that ttq G IIq,. 
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Our main result is provided in Theorem 6.10 but for the sake of clarity we will first 
give this result in the following (simpler) form. 

Theorem 6.4. Using the notations introduced above, let 11^ be defined as in (6.11). // 

limsupsupVi9)[Bs{x)]=0, (6.12) 

then, for all a > there exists 5c{a) > such that for all < 6 < Sc{a) and all integers 
n>l, 

W(n„0ByD^)>esssup^3(#o) 

where 

esssup^g($o) := inf{r > | 7ro[</>o > r] = 0} 

and with similar expressions for the lower bounds C 

Remark 6.5. Theorem 6.4 implies the extreme brittleness of Bayesian inference under 
local misspecification. Indeed, assume that the model class ^o is well specified (i.e. it 
contains the truth fJ) and that, therefore, the Bayesian estimator described by vro is 
consistent. One may believe that a model ^i lying in a "small enough" neighborhood 
of ^0 should have good convergence properties, Theorem 6.4 and Remark 6.3 invalidate 
this belief. Using the notations of Remark 6.3, observe in particular that an unscrupulous 
practitioner may design a model corresponding to a random measure fi2 such that the 
distance between //i (the well specified model) and fX2 is a.s. at most a (where a is 
arbitrarily small) and the posterior value using the random measure //2 is as distant as 
possible from the posterior value using m irrespective of the sample size n. 

Remark 6.6. Observe that the condition (6.12) is extremely weak and satisfied for most 
Bayesian models. This condition can in fact be made weaker by replacing it with the 
assumption that for n sufficiently large it holds true that for all 9, V{9) does not contain a 
mass of Dirac in each ball Bs{xi) (i.e. on the sample data when (5 J, 0). We also note that 
the proof of Theorem 6.4 does not require the samples to be i.i.d., in particular, the same 
results can be obtained with coupled samples, if, for instance, the data map Dg is replaced 
by a data map D such that Cf nr=i /^(^«) ^ Hl^Mi x • • • x yl„] < C^ nr=i /"(^O ^^ 
strictly positive constants Ci and C2. 

Remark 6.7. Theorem 6.4 is a corollary of Theorem 6.10 and the proof of Theorem 
6.10 shows that, if Q is compact and V is continuous and <&(/x) := n{A) for some fixed 
A G B{X), then (see Remark 6.13) the result of Theorem 6.4 holds when using the 
total variation distance d^y instead of the Prokhorov distance, which produces a much 
smaller neighborhood. 

Remark 6.8. Theorems 5.12 and 6.4 are a posteriori brittleness estimates, i.e. pos- 
terior to the observation of the data. Note that under the (weak) condition (6.12) the 
conclusion of Theorem 6.4 holds uniformly irrespective of the size and value of the data. 
We will show in a sequel work that the brittleness of Bayesian Inference is even stronger 
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with respect to a priori statistical estimation error estimates (i.e., after averaging with 
respect to the data generating distribution) because of the possible singularity of that 
distribution with respect to the model under misspecification. 

We will now give a more general version of Theorem 6.4 and elaborate on the objects 
entering in its formulation. 

We start with He ^ M.{@), a set of admissible priors and let 

Uo:=VUeCM{Ao) 

denote the push- forward by the model V. 

We consider the pull-back $e •= *^o ° ^! of the measurable quantity of interest 
^0 ■ A^(Af) — )• M, to a measurable quantity of interest $0 : — )• M. Then the change of 
variables formula [ , Thm. 4.1.11] implies that, for tt@ G A^(0), 

E^e [$e] = E.e t^O o V] = Evne [^o] 

whenever either side is well defined. Therefore, taking supremums and infimums, we 
obtain 

where we note that the quantity of interest implicit in these definitions is determined by 
the argument. For a > 0, define Aa, A, Pq and Pa as in (6.7), (6.8), (6.9) and (6.10). 

Remark 6.9. Using the affine convexity of A4{X), one can show that A is indeed a 
Hurewicz fibration, in that it has the homotopy lifting property, see e.g. [99, Pg. 66]. 
Since d^ : M-{X) x M.{X) — )■ M is continuous, it follows that dj^{< a) := {(/ii,^2) | 
dMilJ'i, fJ'2) < 0} is open and therefore Borel. In addition, since ^0 ^ M{X) is Suslin 
it follows that ^0 x M{X) Q M{X) x M{X) is Suslin. Therefore, since A = dj\{< 
a) n (^0 X -Mix)) it follows that A is Suslin. 

Observe that the measurable quantity of interest $o- M{X) — )• M acting on the 
second component of Ai{X) x A4{X), naturally pulls back to the quantity of interest 
$ : M{X)xM{X) -^ M by $ := ^o°Pa, and we have sup_4^ <l>o = sup_4 $ and inf^^ $0 = 
inf^$, i.e. 

U{Aa) =U{A), 
C{Aa) = C{A). 

For a subset Ho C A4{Ao), the projection identity (6.9) implies that the set 11 := 
Pq Hq defined by Pq~ Ho := {vr G ^A{A) \ Pq-jt G IIo} is the induced set of probability 
measures on A. Moreover, for vr € 11, the change of variables formula 

E,[$]=E,[$0oi^a]=Kp„,r[^0] 
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implies that 

supE^[$] = sup E^J^o], 

TTSn TTaePaH 

infE^[$]= inf E^[$o], 
so that 

PaU = PaPo'Uo C M{Aa) 

is the induced set of probabihty measures on Aa ■ Let us denote this induced set by 

n„ := PaPo^'^o (6.13) 

so that these equahties become 

£(n) = £(n„). 

For conditioning on observations, define Dq as in (6.5) and puh it back to the data 
map D": M{X) x M{X) -^ X(A"^) defined by D" := DfJ o P„. Define B^ as in (6.6) 
and recah the definition (5.2) 

^ r^lRnl %i,M.w[^(/^l>M2)Bnm,/^2)[i?^"]] 

of the conditional expectation and the corresponding (5.6) upper value 

Z^(n0«nD"):= sup E^Q„n[<^\B^] 

s 

in terms of the admissible set (5.4) 

n ©B" D" := {^ D" : vr G U, (vr • D")[S^] > o} 
of product measures, where the marginal is defined by 

(7r.D")[i?,"]:=E(^,,^,)_[D"(/xi,^2)[B5"]]. 

Let us indicate the dependence on some measure tt of the essential supremum of 
some quantity of interest $ by 

7r°°($) :=inf{r GM|7r{^>r} = 0} 

and for a set n of measures 

n°°($) :=sup7r°^(^). (6.14) 

Tren 
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For tTq, = PaTT with vr G 11, we have 

7rc,[$o >r] = (Pa7r)[$o > A 
= 7r[$o oP^>r] 
= 7r[$ > r] 

so that we conclude that 

n~($) = n^($o). 

Let us now quantify a type of regularity for the model V. For x ^ X , let Bq{x) := {x} 
and define 

Poo(5) := supsupP(6')[S5(2;)], for (5 > 0. 

It is clear that Pqo : I^^ — ^ [0, 1] is an increasing function. Moreover, for most parametric 
families, it is easy to show that "Poo is continuous and Voo{^) = 0, and for many of them 
not difficult to find useful upper bounds. 

Finally, let us assume that the model V is positive, in that fi{Bs{x)) > for all 
/i G ^0) X £ X, and 6 > 0. 

Theorem 6.4 is a direct consequence of the following theorem. 

Theorem 6.10 (Extreme Brittleness under Local Misspecification). With the notation 
and assumptions above, let H^ be defined as in (6.13), and let 5 > and < a < 1 
satisfy 

V^{6) < a. 

Then, for all integers n > 1, 

z^(n„0BnD«)>ng°($o) 

with similar expressions for the lower bounds C. 

Remark 6.11. When Cromwell's rule (see Section 2) is implemented (i.e. if the prior 
measure of every non-empty neighborhood is strictly positive), it follows that n^(<I>o) = 
U{Ao) so that the conclusion of Theorem 6.10 becomes 

Remark 6.12. Theorem 6.10 provides conditions sufficient to guarantee how bad things 
can get regardless of how many samples are taken. One might hope that when these 
conditions are not satisfied, that more samples may prove beneficial. However, when the 
condition 

inf D"(/i,^')[^5] =0, /xGA 

(At,Ai')G*-V 

of Theorem 6.1 is only approximately satisfied, the inequality 

n 

and the quantitative version of Theorem 5.12 (given in [ , Thm. 3.1], see also [85, 
Rmk. 3.2]) imply that things actually get worse with more samples. 
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Remark 6.13. The proof of Theorem 6.10 shows that we can obtain a similar result 
when using the total variation distance dxv instead of the Prokhorov distance, which 
produces a much smaller neighborhood. However, in this metric A4{X) in general is 
not separable and this introduces measurability difficulties. These difficulties can be 
overcome somewhat when Q is compact and V is continuous, since the image of a compact 
set under a continuous map is compact and therefore measurable. Moreover, validation 
or certification type quantities of interest defined by $(//) := /^(^) for some fixed A S 
B{X) are easily seen to be continuous and therefore measurable. Moreover, because of 
continuity, 

n§°($o)«n-(ci>o). 

Our motivation in working mainly with the Prokhorov metric lies in the fact that we 
also seek to lay down measurability foundations for the scientific computation of optimal 
statistical estimators where the unknown quantities are products of functions and mea- 
sures and for such spaces the total variation metric is too strong for the measurability 
of standard quantities of interest. 

7 Admissible Sets as Measurable Spaces 

In the Kolmogorov formulation of probability, to put probability measures on an admis- 
sible set A C J^{X) X M.{?i) requires that the admissible set be a measurable space, 
i.e. A must be equipped with a cj-algebra of subsets upon which measures can be de- 
fined. This section concerns the development of such measurable structures. We will 
first describe a simple measurable structure corresponding to a non-separable complete 
topological space. However, the non-separability appears to make much of our analysis 
difficult, and so we also develop measurable structures which come from Polish (com- 
pletely metrizable separable) spaces, which appears to give us what we need- not only 
the ability to apply the reduction theorems of Owhadi et al. [' ' ], but appears to satisfy 
the technical needs of developing the Bayesian OUQ framework. See also the discussion 
of the benefits of Polish spaces in Remark 3.2. 

To begin, let Af be a metrizable topological space, let ^(Af) be a set of real-valued 
functions on X, and consider the space M{X) of Borel probability measures on X 
equipped with the weak-* topology and the corresponding Borel cj-algebra B[A4{X)Y 
To put a probability measure on a subset 

A^T{X) xM{X) 

requires defining a cr-algebra of subsets of A. To do this in a way that is robust to the se- 
lection of the particular subset A it is sufficient and, depending on the nature of the set of 
permissible assumption sets, possibly necessary to specify a cr-algebra on J^{X) x M.{X) 
and induce a cr-algebra to a specified subset A C J-'{X) x M(X) through relativization. 
Let us first consider generalized moment constraints. By [ ,, Thm. 15.13], for a separable 
and metrizable space X, and for any bounded measurable function g: A:" — )• M, it follows 
that the map Ai{X) — )• R defined hy fi >-^ J g dfi is measurable. Therefore, arbitrary 
bounded generalized moment constraints are in general measurable. 
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However, the objective function for many OUQ problems is not quite so simple. 
Often it is of the form 

^(/, m) = (/*^) {A), for (/, ix) G F{X) x M{X) 

for some fixed measurable subset ACM. Therefore, we can deduce a c-algebra on 
F{X) X M{X) by pulhng back the a-algebra BiM) to F{X) x M{X) using the map 
(/, fi) I— 7- (/*/i) {A) and also pulling back under each of the maps (/, ^) i— )■ ^^[51] for g € G 
where Q is the constraint set. However useful such a method might be for a particular 
problem, it is appealing to describe measurable structures on J-^{X) x M{X) for which 
a large class of objective functions $, in particular ^A{f,fJ') = (/*^)(^) for various A, 
and constraints, would be generally measurable. 

To that end, we first demonstrate that by restricting J-^{X) to the bounded measur- 
able functions B{X) equipped with the supremum norm, that the product structure for 
B{X) X Ai(X) makes objective functions ^Ai A € B(R), measurable and generalized 
moment constraints measurable. However, although the topological space B{X) xAi(X) 
is complete and metrizable, it is, in general, not separable. We then describe a general 
procedure for choosing smaller, but still very large sets of functions J^iX) in such a way 
that T{X) X A4{X) is Polish and such that all objective functions <1>^(/, /i) = (/^,;u)(A) 
for A € B(R) and all generalized moment constraints are measurable. To that end, let 
us describe some notation. For a topological space Z, let B{Z) denote the Banach space 
of bounded measurable real valued functions on Z with the uniform (supremum) norm 
II • 11^ and let B[B{Z)^ denote the cr-algebra of subsets of B{Z) corresponding to the 
metric topology of B{Z). When Z is metric, let BL{Z) denote the space of bounded, 
Lipschitz, measurable real valued functions on Z with norm || • ||bl := 1 1 • 1 1 00 + II " 1 1 Lip- 

For a set J^{X) of real valued measurable functions on X equipped with a topology 
t{J^) we consider the map 

J:J^{X)xM(X)^M{R) (7.1) 

defined by 

J{f,fi):=f,fi, iov{f,fi)e:F{X)xM{X). (7.2) 

We will be interested in developing conditions on the topological space [T{X),t{J^)) 
which guarantee that J is measurable, that is 

J: (j^{X) X M{X),B{J^{X) X M{X))] -^ (m{R),B{M{R))) . (7.3) 

The goal of this section will be the proof of the following theorem: 

Theorem 7.1. Suppose that X is Polish and Ai{X) and Ai(R) are equipped with the 
weak-* topologies. When X is compact, consider J-{X) := C{X), the Banach space of 
continuous functions. Or more generally, let J-^{X) be a RKHS (Reproducing Kernel 
Hilhert Space) or RKBS (Reproducing Kernel Banach Space) of real functions with a 
measurable feature map, or the space UC(Af) of upper semicontinuous functions with 
the Wijsman topology obtained through the identification of an upper semicontinuous 
function with its hypograph. Then J-^{X) x Ai{X) is Polish and J: J-{X) x Ai{X) — )■ 
A^(M) defined by J{f,iJ,) := /*/i is measurable. 
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7.1 Evaluation Measurable Function Spaces 

In anticipation of proving measurability by proving that J is a Caratheodory function 
(one that is continuous in one variable and measurable in the other [,, Def. 4.50]), the 
following lemma is useful and should have independent interest: namely [I, Thm. 15.14] 
states that if f : X ^ y is continuous, then /* : Ai{X) — )• M.{y) is continuous. We 
show, using a result of Kechris [71, Thm. 13.11, p. 84], that the continuity requirement 
of / can be removed while still obtaining continuity of f^,. Continuity with respect to fj, 
is a large step towards the measurability of J: now all that remains is the measurability 
with respect to /. 

Lemma 7.2. Let {X,tx) he Polish and {y^Ty) metrizable and second countable, and 
let M.{X) and M^{y) denote the corresponding spaces of Borel probability measures en- 
dowed with the weak-* topologies. Let f : {X,B{tx)) — >• {y,B{Ty)) be measurable. Then 
f^ : M{X) — )■ M{y) is continuous. 

We are now in a position to state our first measurability result. 

Proposition 7.3. Let X be Polish and consider M[X) and A^(]R) endowed with the 
weak-* topologies. Consider J-{X) := B{X) endowed with the metric topology corre- 
sponding to II • Hoc- Then B{X) x M.{X) is complete metrizable and the map J, defined 
in (7.1) and (7.2), is measurable. 

Proposition 7.3 shows that J is measurable with respect to the Borel structure of the 
product space B{X) x M.{X). However, although the product space B{X) x Ai{X) is 
complete and metrizable, it is in general not separable. Indeed, it appears that this space 
is so large that separate continuity of J: B{X) xAi(X) — )• A^(M) is available, suggesting 
that it might be possible to weaken this measurable structure by judicious choice of 
topological function space [J-{X),t{J^)^ in such a way that makes J^{X) x Ai{X) into 
a Polish space and keeps J measurable. 

In Arens [ ] a topology on a space of functions is called admissible if point evaluation 
is jointly continuous in the product of the space of functions and the domain. To achieve 
the aforementioned goal of making J^( A') xA^(Af) Polish and keeping J : J^{X)xAi{X) — )• 
A^(M) measurable for a set J-" of measurable functions, we generalize Arens' notion from 
topological spaces to measurable spaces by requiring the identity map to be a normal 
integrand in the sense of Rockafellar and Wets [90, Def. 14.27]. Specifically, 

Definition 7.4. A set J~{X) of real valued measurable functions on a topological space 
X equipped with a a- algebra cr(J^) of subsets of J^{X) is called an evaluation measurable 
function space if the mapping ix'. [T{X),a{T)^ — )■ (M, ;B(]R)) defined by 

ixf ■■= fix), for / € T{X) 

is measurable for all a; G Af. 

In many situations, it appears to be a minimal requirement that a measurable space 
(j-{X),a{J-)) be evaluation measurable. The following result shows that this is equiva- 
lent to the measurability of J in the product structure: 
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Theorem 7.5. Suppose that X is Polish. Then J: T{X) x M{X) — ^ 7W(M) is measur- 
able in the product structure cr(F^ x B{^M.{X)) if and only if (^F{X), (t{F)) is evaluation 
measurable. 

In particular, since point evaluation on B{X) is continuous, it is measurable, so that 
Theorem 7.5 implies Proposition 7.3. 

The following corollary says that composing the response function / with a utility 
function h maintains measur ability: 

Corollary 7.6. Given the assumptions of Theorem 7.5, suppose that h: M — t- M is 
measurable. Then the map J^: J-^{X) x M{X) — )• A^(]R) defined by 

Mf, fi) := {h o /),/x, for (/, /i) e T{X) x M {X) 

is measurable in the product structure. 

Furthermore, when (j^(Af), t(J-")) is Polish, Theorem 7.5 and Corollary 7.6 provide 
measurability on the the product space as follows: 

Corollary 7.7. Suppose that X and [F{X),t{F)) are Polish. Then J : F{X)xM{X) ^ 
MiM.) is measurable in the Borel structure B{^J-{X) x M.[X)^ of the product space if and 
only if (^F{X),B{t{F))^ is evaluation measurable. 

Moreover, if (^F{X),B{t{F))) is evaluation measurable, then for any measurable 
function h:R^R, the map Jh'- F{X) x M{X) — > 7W(M) defined by 

Mf, fi) := {h o /),/x, for (/, /i) G F{X) x M {X) 

is measurable in the Borel structure of the product space. 

7.2 Polish Evaluation Measurable Function Spaces 

As a consequence of Corollary 7.7, it follows that to make F{X) x A4(X) a Polish space 
such that J is measurable, it is necessary and sufficient that F{X) be a Polish evaluation 
measurable function space with respect to its Borel structure. Fortunately, such spaces 
already have been well studied in the literature. 

The Banach space C{X) of bounded continuous functions is, by [ , Lem. 3.99], separa- 
ble when X is compact and metrizable. Since point evaluation is continuous in general, 
when X is compact metrizable C{X) is then a Polish evaluation measurable function 
space. When X is not compact, this is not the case. 

Reproducing Kernel Hilbert Spaces (RKHS), see e.g. [KM, Sec. 4] and [ ], are ex- 
tremely important in Learning Theory. Unlike the Lebesgue space L^, a RKHS H is a 
Hilbert space of real valued functions — not equivalence classes — characterized by the 
fact that point evaluation is a continuous function on the Hilbert space. Consequently, 
any separable RKHS of functions on Af is a Polish evaluation measurable function space. 
To obtain separability, [101, Lem. 4.33] asserts that if X is separable and the kernel 
k corresponding to the RKHS H is continuous, then H is separable. More generally, 
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Steinwart and Scovel [iU5, Cor. 3.6] show that if there exists a finite and strictly pos- 
itive Borel measure on X, then every bounded and separately continuous kernel k has 
a separable RKHS H. Also, [23, Thm. 15, pg. 33] shows that RKHS H is separable if 
there is a countable subset Xq <Z X such that ii f £ H and f{x) = for all x G Aq then 
/ = 0. Finally, a result of Fortet [ \ Thm. 1.2] asserts that a RKHS H with kernel k 
is separable if and only if for all e > there exists a countable partition Bj ,j G N of A' 
such that for all j G N and all a^i, 2:2 G Bj we have 

k{xi,xi) + k{x2, X2) - k{xi,X2) - k{x2,xi) < e. 

A RKHS is usually implicitly defined by its kernel and often it is desirable to have 
a more concrete representation of the corresponding RKHS H. Although important 
RKHSs such as the Fock space described by Bargmann [0] have been known for some 
time, it is only recently that Steinwart, Hush and Scovel [ ' ] provided an explicit de- 
scription of Gaussian RKHSs. However, even without an explicit representation, often 
we can say something about how expressive the RKHS is in terms of approximation prop- 
erties. Steinwart [lOU] introduced universal kernels on compact domains as those whose 
RKHS can approximate any continuous function uniformly, and demonstrated that many 
of the existing popular kernels, in particular the Gaussian RKHSs, are universal. For 
noncompact X, Steinwart, Hush and Scovel [103] provide conditions on the kernel which 
guarantee approximation properties in L^ spaces. Most important however, is that the 
expressive capability of the Gaussian RKHSs is part of what allowed Steinwart and 
Scovel [104] to prove that support vector machines learn fast. For a thorough discussion 
of these topics, see [iOl]. Although the current investigated approximation properties 
are with respect to Learning Theory, they suggest that approximation properties with 
respect to, for example, J(/, /i) = f^:fj. might be available. 

Reproducing Kernel Banach Spaces (RKBS), introduced by Zhang, Xu, and Zhang 
[122] are Banach spaces of real valued functions for which point evaluation is continuous. 
Therefore, any separable Reproducing Kernel Banach Space is a linear Polish evaluation 
measurable function space. An "if and only if" characterization of separability is ob- 
tained through a generalization of Fortet's Theorem from RKHSs to RKBSs. We suspect 
our proof is similar to Fortet's for RKHSs, but it is not written down in [ ]. Indeed, 
Fortet's result mentioned above, is a regularity condition on the pullback metric 

dH{xi,X2) := ||$(xi) - ^{x2)\\hi = y/k{xi,xi) + k{x2, X2) - /c(xi, X2) - k{x2,xi) 

to X determined by a feature map <l>: Af — )• Hi. In particular, Fortet's condition then 
becomes: for all j G N and all xi, X2 G Bj we have 

dH{xi,X2) < \/e. 

We refer to [122] for the foundational facts and terminology regarding RKBSs. 

Lemma 7.8 (Fortet). A RKBS B is separable if and only if there exists a feature Banach 
space W and feature map ^■. X ^ W for B such that for all e > there is a countable 
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partition Bj C X,j G N with UieN^i ~ ^ such that for all j G N and all xi,X2 € Bj, 
we have 

\Mxi)-^ix2)\\w<e. (7.4) 

From Lemma 7.8, it follows that if X is Lindelof (meaning that every open cover has 
a countable subcover), then any RKBS of functions on X which has a continuous feature 
map <I> : A' — 7- W is separable. Therefore, since Polish implies Suslin implies Lindelof, 
RKBSs of functions on Polish or Suslin spaces are separable when there is a continuous 
feature map. Moreover, from the proof of Lemma 7.8, we easily conclude the RKBS 
version of [101, Lem. 4.33] when combined with [ ; : , Lem. 4.29]: a RKBS of functions 
on a separable space X is separable if it has a continuous feature map. 

Finally, a very strong characterization of separability is available, due to a theorem of 
Stone [10(3, Thm. 16, pg. 32], when Af is a separable absolutely Borel space, in particular, 
when X is Polish. Following Frolik [ ], a metrizable space X is said to be absolutely 
Borel if Af C ^ is a Borel subset for all metrizable Z. Moreover, Frolik [56] introduces 
bianalytic spaces as analytic spaces such that their complement in their Cech compact- 
ification is also analytic and, in Frolik [•")(), Thm. 12], shows that a metrizable space is 
separable absolute Borel if and only if it is bianalytic. The following result regarding 
the separability of RKHS and RKBS is of independent interest and easily gives us our 
main result when X is Polish. 

Lemma 7.9. Let X be bianalytic and let fC he a RKHS with measurable feature map or 
a RKBS with measurable primary feature map. Then /C is separable. 

Theorem 7.10. Let X be Polish and let IC be a RKHS with measurable feature map or 
a RKBS with a dual pair of measurable feature maps. Then IC is a Polish evaluation 
measurable function space. 

The space D(X) of differences of upper semicontinuous functions has been inves- 
tigated by Rosenthal [■:] who describes a Banach space structure for it. However, 
regarding separability, Rosenthal [92] tells us that "off the top, I'd say — almost never. 
e.g. if X has a non-trivial convergent sequence" . 

The space DC(A') of differences of convex functions, see Tuy [109], is important 
in non-convex optimization because if the decomposition into the difference of convex 
functions is known then convex analysis can be used for both the algorithms and the 
development and evaluation of optimality criteria. Moreover, the class DC(Af) is rela- 
tively rich. For example, Tuy [109, Prop. 3.2] states that the restriction /q of any twice 
continuously differentiable function /: M"" — )• M to a compact convex set fl C M" is in 
DC(ri). At present, we are not aware of any topologies for either D(X) or DC(^) that 
would make them Polish. 

The space UC(Af) of upper semicontinuous functions is not linear, but a cone. Semi- 
continuous functions are important in many areas of mathematics, in particular, opti- 
mization theory, and since the OUQ framework is optimization based, it appears natural 
to consider it. In the following sections we will describe a topology for the space UC(Af) 
which makes it into a Polish evaluation measurable function space. Note that, unlike the 
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above examples, here point evaluation will not be continuous, but semicontinuous — and 
therefore still measurable!. We conjecture that this topology can be used to topologize 
D{X) and TiC{X) in such a way as to make them Polish evaluation measurable function 
spaces, but we leave that for the future. 

7.3 Polish Topologies for Upper Semicontinuous Functions 

Following Beer [ ], we topologize the space of upper semicontinuous functions UC(Af) 
through the identification of an upper semicontinuous function with its hypograph, which 
is a closed set, and a topology on the space of closed sets. To that end, recall that an 
upper semicontinuous function / : A' — t- M is such that its hypograph 

hypo(/) := {(x,a) G A- x M | f{x) > a} 

is a closed set. An equivalent definition is that the excursion set 

{xeX\ f{x) > a} 

is closed for all a € M. It follows then that upper semicontinuous functions are measur- 
able. The mapping / i— )• hypo(/) can be used to pull back structures from the set of 
closed convex sets to sets of upper semicontinuous functions. Let us denote the space 
of upper semicontinuous functions on X by UC(Af) and the space of closed subsets of 
A' X M by CL(A: x M). Then we have a map 

hypo: UCCA-) ^ CL(Af X M) 

and so if we topologize, metrize, or measurablize CL(Af x M) we can pull back such 
structures to UC(Af) through the map hypo. 

7.4 Hyperspace Topologies and Measurability 

Since the map hypo : UC(Af) — )• CL(Af xM) gives a method for transferring structure from 
spaces of closed subsets, we now describe topologies and cr-algebras for the hyperspace 
of closed subsets. This subject has been heavily researched and we will only define 
and use what we need. Evidently, the classic reference is Matheron [80]. Lest one 
should think that these ideas come from the ivory tower, note that Matheron developed 
these ideas for mining, see Agterberg [.'] for an illuminating biography. For each of 
the increasing categories-compact, locally compact, Polish, for the base space X we will 
establish topologies on the the space UC(A') of upper semicontinuous functions making 
it a Polish evaluation measurable function space. Moreover, we will do so in one stroke. 
For a topological space X let G denote the collection of non-empty open sets, T the 
collection of non-empty closed sets, /C the collection of non-empty compact sets. 

The most famous hyperspace topology is the Vietoris topology [82] but we will not 
use it, except to note that by [i, Thm. 3.91] that the Vietoris topology Ty and the 
Hausdorff metric topology t^,, to be defined below, coincide when relativized to /C. Let 
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us proceed to the Hausdorff metric topology. When (Af, d) is a seniimetric space we can 
define the distance from a point x €z X to a subset A C X hy 

d{x,A) := inf d{x,x'). 

x'eA 

and the Hausdorff distance between arbitrary subsets by 

hd{A, B):= max \ sup d{x,B), sup d{x', A)\ , iovA,BCX. (7.5) 

By [4, Lem. 3.74] we have a characterization which, among other things, allows a direct 
comparison of the Hausdorff metric topology with the Wijsman topology which we will 
describe soon: 

hdiA,B) = sup\d{x,A) -d{x,B)\, iov A,B <ZX. 

X&X 

When d is a metric, it follows that hd is an extended valued metric on the set of closed 
subsets T and as such by [ , Lem. 3.77] defines a first countable Hausdorff metric topol- 
ogy Th on J^. Moreover, by [4, Thm. 3.91] when relativized to /C the Hausdorff metric 
topology is topological in that it is the same for all admissible metrics for X metrizable. 
For our purposes, we need the fact [ I, Thm. 3.85] that for a metric space {X, d), 

(-F, Th) is Polish if and only if (X, d) is compact (7.6) 

from which we conclude that 

X is compact metrizable =^ {J-, t^) is Polish (7.7) 

and the same, for all admissible metrics d. Consequently, when a metrizable X is not 
compact, then the Hausdorff metric topology on J^ defined by any admissible metric is 
not Polish. 

When X is metrizable but not compact, let us consider instead the Fell topology 
[51]. It is defined as the topology Tp generated by the base consisting of 

{F £T\FnG^0}, Geg (7.8) 

{F €T\FnK = 0}, K£lC. (7.9) 

For our purposes, we need the fact [-±, Cor. 3.95] that for a locally compact Polish space 
X, we have 

X locally compact and Polish =^ {F,Tp) is Polish. (7-10) 

For the other direction, Molchanov [83, Thm. B.2.iii] asserts that when X is Hausdorff 
{T,Tp) is Polish =^ X is locally compact and second countable. (7-11) 

Moreover, [ , Thm. 3.93] implies that 

X is compact and metrizable =^ Th = Tp (7-12) 
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for any admissible metric d. Therefore, the Fell topology can be considered a Polish 
generalization of the Hausdorff metric topology from compact metrizable X to locally 
compact Polish X. However, for Hausdorff spaces, (7.11) asserts that this Polish gener- 
alization does not go past locally compact second countable spaces. 

To get past this to infinite dimensions, let us consider the Wijsman [12u] topology 
Tw on a metric space (Af, d), defined as the initial topology generated by the functions 

A^d{x,A), for A e J" (7.13) 

as X varies over X. It is the weakest topology on T such that the function (7.13) on 
F is continuous for all x ^ X. Wijsman [120] demonstrated that this topology makes 
the Fenchel transform continuous on locally compact spaces. Moreover, even though his 
results were stated for convex functions and sets, it is clear that much of his results carry 
over to non-convex sets and functions. A complete investigation of this topology, and its 
history, can be found under the name now used, P-convergence, in Dal Maso [39], where 
in particular, P-convergence is shown to be a powerful tool in homogenization. 
However, what we use here is the following result of Beer [' ^]: 

X is Polish =^ {F,Tw) is Polish. (7.14) 

To show that the Wijsman topology is a Polish generalization of the Fell topology, 
observe that Beer [13, Thm. 2.3] shows that when X is metric, we have Ty/ = Tp if and 
only if X has nice closed balls, i.e. the only non-compact closed ball is the whole space. 
Moreover, Beer [11, Thm. 2] shows that when X is metrizable, it is locally compact if 
and only if it admits a metric with nice closed balls. Consequently, when X is locally 
compact and metrizable we have tw = Tp for any admissible metric. On the other hand. 
Beer [I-'), Pg. 92] shows that if a metric space has nice closed balls, then it is complete. 
Consequently, since a metric space with nice closed balls is clearly locally compact, we 
have for metric X that Tw = Tp implies that X is locally compact Polish and therefore 
we conclude 

X locally compact Polish <;=^ tw = Tp (7.15) 

for all admissible metrics. 

For a thorough investigation into the interrelationships between the Vietoris, Haus- 
dorff, Fell, Wijsman and other hyperspace topologies for various topological spaces X, 
see Beer, Lechicki, Levi, and Naimpally [17]. In particular, note that they show [17, 
Thm. 3.1] that for a metrizable space X that the Vietoris topology is the supremum of 
the Wijsman topologies over all admissible metrics, that is, the weakest topology such 
that A I—)- d{x,A),A G -F is continuous for all x G Af and all admissible metrics d. On 
the other hand, when X is locally compact. Beer [14] shows the infima of the Wijsman 
topologies over the admissible metrics is the Fell topology. More generally, Costantini, 
Levi, and Pelanta [■,;;, Thm. 3.1] show that the infima of the Wijsman topologies is the 
topology of upper Kuratowski convergence. 



57 



7.4.1 The Effros cr-algebra 

Now let us discuss measurability on T . The Effros [ ] cr-algebra ge{J~) on T is defined 
to be the u-algebra generated by sets of the form 

{Fg J^l FnG/0for Gg^}. (7.16) 

Beer [15, Pg. 1125] credits Hess [()4, 65] with the result that 

X is separable metric =^ c7e{^) = ^{tw)-, (7-17) 

i.e. the Effros a-algebra is generated by the Wijsman topology. We summarize the above 
discussion in the following proposition: 

Proposition 7.11. Consider three admissible cases: 



X 



compact metrizahle with Hausdorff metric topology on J- 
locally compact Polish with Fell topology on J- 
^ Polish with Wijsman topology on T 



In all three cases, the same hyperspace topology results when we use the Wijsman topol- 
ogy. That is, the above is the same as 

{compact metrizable with Wijsman topology on T 
locally compact Polish with Wijsman topology on T 
Polish with Wijsman topology on T . 

Moreover, in all three cases the hyperspace topologies are Polish and the Borel a-algebra 
corresponding with these topologies is the Effros a-algebra. 

7.5 Main Theorem for Semicontinuous Functions 

We are now ready to state our main theorem. 

Theorem 7.12. Suppose that X is Polish, and consider the hyperspace Ch{X x M) of 
closed subsets with the Wijsman topology tw. Let UC(Af) denote the upper semicontinu- 
ous functions and hypo: UC(A') — )■ CL(<Y x M) be the hypograph mapping. Consider the 
pullback topology t\y{UC) := hypo" {tw) o^^d its resulting Borel a-algebra B[t\y[UC)^ 
of subsets of UC(^) . Then 

{vC{X),B{tw{UC))) 
is a Polish evaluation measurable function space. 
Remark 7.13. For an upper semicontinuous function /, define pf: A" — t- M by 

U)i^)-= sup f{y). 
y&Bp{x) 
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Wijsman's Theorem [120, Thm. 6.1] states that when X has nice closed balls, then 
hypo(/„) -^ hypo(/) if and only if 

/ = hni hm sup pfn 
/ = limliminfp/„ 

p— >-0 rn-oo 

Although it appears that [120, Thm. 6.1] may be correct for Polish spaces that are not 
locally compact, the proof utilizes [120, Thm. 3.1] which required that the space have 
nice closed balls. However, the existence of this extension to non locally compact spaces 
does not affect the assertion of Theorem 7.12. 

Corollary 7.14. In each of the three admissible cases for X and the specified topol- 
ogy T on \]C{X) defined in Proposition 7.11, it follows that (\JC{X), B{t)^ is a Polish 
evaluation measurable function space. 

8 Proofs 

8.1 Proof of Theorem 4.6 

For g G M", define 

U{q) := ^-^Q = {7r €M{A)\ E^[*] = q} 

and let n(g, n) := n(g) n A(n) C Il{q) be the subset consisting of (n + l)-fold convex 
combinations of Dirac masses. Using a layercake approach, we use the fact that 

n(Z) = IJ n{q) and U{Z,n) = [j U{q,n), 

q&Z qeZ 

while applying Theorem 4.4 with equality constraints n(g),g G M", and the fact that 
the supremum over a union is a supremum of suprema to obtain a reduction as follows: 

U{ll{Z))=u{[jn{q) 

\qeZ 

= sup^(n(Q)) 

q&Z 

= supW(n(g, n)) 
qez 

= w(|Jn(g,n) 

\gez 

= U{ll{Z,n)). 
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8.2 Proof of Lemma 4.10 

Since T C Q is a subset of a separable metrizable space, [4, Cor. 3.5] implies that it 
is itself separable and metrizable. Consider the set-valued map with non-empty values 
^^^ : T ->^ A with graph G defined by 

G := { {q, if, fi))eTxA\ ^{f, ^l)=q}. (8.1) 

Let d be a metric that generated the topology of T and define h: T x A — )• M by 
h[q,{f,ijLJ) := d{^ {f , fi) , q) . Then, since d is continuous in each of its arguments, it 
follows that /i is a Caratheodory function, as defined in Definition 9.2. Since T is sepa- 
rable and metrizable. Lemma 9.3 implies that h is B{T) i3(^)-measurable. Rewriting 
Equation (8.1) as 

G := { {q, (/, ^l))eTxA\ h{q, (/, /i)) = O} 

yields that G belongs to B{T) (g) B{A). Lemma 9.1 (through the identification S = A, 
s = (/, /x), ip{t,s) = <!>(/, /x)) implies that the function U o ^~^: T ^ R defined for 
g E T by g I— )• sup/j^^^g^-i/^) ^{f,fi) is ;B(T)-measurable, thereby establishing the first 
assertion. The second assertion then follows from the second part of Lemma 9.1. 

8.3 Proof of Theorem 4.11 

For the first assertion, consider Q G £2. Then, by the second assertion of Lemma 4.10, 
there exists a 6(suppQ)-measurable section ^p of ^, i.e. a 6(supp(Q))-measurable func- 
tion tp: supp(Q) —7- A such that 'if{'ip{q)) = q for all q £ supp(Q). Let Q also denote its 
restriction to its support and Q its completion. Let vr := ipQ £ Ai{A), so that, for all 
A G i3(supp(Q)), 

{^TT)iA) = {^ijQ){A) 

= {{^oi;)Q){A) 



Hence, ^vr = Q, establishing the first assertion. 

For the main assertion, observe that, for all (/, /x) G A, 

(Wo^-io*)(/,^)= sup $(/',/.') >^(/,^). (8.2) 

Consequently, for Q G £J, the first assertion shows that there is a vr such that ^tt = Q, 
so that a change of variables (Proposition 9.7) and the monotonicity properties (Propo- 
sition 9.4) of these integrals, together with the inequality (8.2), imply that 

>E^[$], 
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o^-i]> sup E^[$]. 



sup Eq[^/ o ^-1] > sup E^[$] = ZY(*"^0) 



sup Eq[U o ^-1] > U{^-^£}). (8.3) 



and therefore 
Consequently, 
and, in particular. 



To obtain the reverse inequality, for 6 > consider Q G Q and apply Lemma 4.10 
to conclude that there exists a (5-optimal ;S(supp(Q))-measurable section of ^; that 
is, a 6(supp(Q))-measurable function ip: supp(Q) — )• A such that "^(^ipiq)^ = q for 
ah q G supp(Q) and ($ o 'ilj){q) > {U o ^~^)(g') - 5 for all q G supp(Q). Now let 
TT := ipQ £ A4{A), and observe from the proof of the first assertion that ^vr = Q, and 
therefore vr G ^~^Q. Therefore, by a change of variables, 

E^[c1>]=E^q[<I>] 

Since, by definition, Eq[U o ^^i] := E^[U o ^^i], it follows that 

^/(^-iQ) = sup E^[$] 

> supEq[Z^o^-1] -,^. 

Since 5 > was arbitrary, it follows that 

Ui'^-^H) > sup Eq[U o ^-1]. 

Recalling the reverse inequality (8.3), we obain the main assertion. 
The assertion of measure affinity follows from Lemma 9.9 
For the assertion (5.11), define 

n := {vr G MiA) I E^['0i] = for i = 1, . . . , n} . 

Let e > 0. Assume that sup gn+ ^■n-+ [^] > '^ and that 7r_(_ G n_|_ is such that Et^^ [<&] > 
A. Observe that vr := 7r^/'K^{A) is an element of 11 that satisfies E7r[$ — AV'o] > 0. 
Define n„ as in (4.9) and apply [<H5, Thm. 4.1] to sup^gn ^I^tt [^ — AV'o] to conclude that 
there exists vr* G n„ such that Ej^* [$ — AV'o] > 0. Since <I> — Xtpo = {ip — X)'ipo and tpo 
is positive, it also follows that E^* [f/'o] > 0. Writing vri^ := vr* /Ejr* [V'o] we obtain that 
vri^ G n+^„ and K^^* [<1>] > A, which concludes the proof of (5.11). 



61 



8.4 Proof of Lemma 5.1 

Consider the set 

y := [J{Oy\Oy<^y is open and i^{Oy) = 0} . 

First let us show that E = y' . To see this, first observe that trivially we have E <Z y' . 
Now suppose that y £ y' . Then there exists a y' £ y and an open Oy' B y' such that 
y £ Oy' and v{Oyi) = 0. Therefore, y £ E and hence E = y' . 

Now, since y' is a union of open sets, it is open and therefore measurable. Moreover, 
since y is strongly Lindelof, it follows that y' is Lindelof and that the open cover of y' 
by z/-null open sets used in the definition of y' has a countable subcover, so that 

y'=[j Oy, 

where each Oy. is open and has v{Oy^) = 0. It follows that 

v{E) = v{y')<Y,<Oy,) = d 

and the proof is finished. 

8.5 Proof of Theorem 5.7 

The first assertion, (5.10), follows by layering the set of positive measures of finite total 
mass as IJ^g^ {'"■^(•^)}j using the fact that the supremum over a union is a supremum 
of suprema, and applying the reduction theorem [86, Thm. 4.1] in rM.{A) separately. 
For the second assertion, (5.11), define 

li ■= {tt £ M{A)\¥.^[ilJi] = for i = l,...,n} 

Let e > 0. Assume that sup^^gn+ ^7r+ [*&] > A and that 7r+ £ 11+ is such that Ett^ [$] > 
A. Observe that vr := 7r+/7r+(^) is an element of 11 that satisfies E7r[$ — XiJjq\ > 0. 
Defining Yin as in (4.9) and applying [86, Thm. 4.1] to sup^gnl^vr[$ — AV'o] > we deduce 
that there exists tt* £ n„ such that Ejr* [$ — Af/'o] > 0. Since $ — AV'o = (v ~ -^)V'o and 
'00 is positive, it also follows that E^. \_^o\ > 0- Let 7r!j_ := 7r*/E7r*['0o] to obtain that 
vr^ £ n_|_,„ and E^r* [<&] > A, which concludes the proof of (5.11). 

8.6 Proof of Theorem 5.8 

Observing that 

U{Yi{q) Qb 3) = supU{U{q) Qb ©) 

we may fix D G 5}. First, we prove that 

U{U{q)QBn)= sup E(^,^)^,^[cl>(/,/x)D(/,M)[S]], (8.4) 

TT+en+Cq) 
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where n+(g) is the set of positive finite measures 7r+ on A such that E^r^ [^{f, /") — q] =0 
and Ett^ [D(/, ^)[i?]] = 1. To that end, first observe that 

U{U{q) Qb O) = sup E^0o ['^\B] 

TTeU{q):nQO[B]>0 

and that, for any vr E n(Q') such that vr D[S] > 0, 

E^0B <1> ±? - ^—— — —— . (8.5) 

Now consider vr G n(g) such that vr D[5] > 0. Then vr+ := vr/E^[D(/,/x)[B]] is an 
element of n+(g) and (8.5) imphes that 

Conversely, if vr+ € Il-^{q), then vr := vr-|-/vr-|-[^] is an element of n(g) such that vr 
D[B] > and 

Since the above argument also shows that n(g) 0^ O is nonempty if and only if Il-^-{q) 
is nonempty, (8.4) follows. The right hand side of (8.4) is a linear program in vr+, so 
Theorem 5.7 implies that the supremum in vr+ can be achieved by assuming vr+ to be 
the weighted sum of at most n + 1 Dirac masses, i.e. by assuming that 

n 

^+ = ^^^^f^,l^. (8.6) 

i=0 

This finishes the proof of Theorem 5.8. 

8.7 Proof of Theorem 5.10 

First let us show that, for A G M, the statement that 

E^0d[$|5]>A, vr0De^-^(n)0ijD (8.7) 

is equivalent to the statement that 



^(/,m)~^ 



{^f,^l)-X)B{f,^^)[B] 



>0. 



To that end, assume (8.7) and observe that the definition (5.4) of *& ^{O.) Qb ^ implies 
that vr • D[5] > 0, where, by (5.5), 

vr.D[S]:=E(;,^)_[D(/,^)[S]]. (8.9) 
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Consequently, by (5.2), 

and the denominator is strictly positive. Therefore, 

^iM^4{Hf,fi)-XMf,f^)[B]] 

= E(^,^)^, [$(/, ^)D(/, ^) [B]] - AE(^,^)^, [D(/, /x) [B]] 
>0, 

and (8.8) follows. Conversely, assume (8.8) and observe that vr • 3[B] > 0. To see this, 
observe that, if vr • D[i?] = 0, then (8.9) implies that D(/, IJ.)[B] = vr-a.s. and so 

which is a contradiction. Consequently, vr • D[i?] > and dividing the assumption 

= E(^,^)^, [$(/, ^)D(/, ^) [i3]] - AE(^,^)^, [D(/, /.) [S]] 
>0 

by vr • n[B] := E(j^^)^^ [lD'(/, Ai)[-B]] throughout yields (8.7) and the equivalence is estab- 
lished. 

Using this equivalence, the main assertion then follows from a direct application of 
Theorem 4.11. Finally, since $ is semibounded, it follows that (/, ^) i— )• <!*(/, /i)D(/, /x)[-B] 
is semibounded and measurable, and the assertion of measure affinity follows from 
Lemma 9.9. 

8.8 Proof of Theorem 5.12 

Let us first establish that the assumptions of the theorem are well defined. To that 
end, note that Lemma 4.10 implies that q i— )• inf(j^^)g^-i(g) D(/, //)[i?] is ;S(supp(Q))- 
measurable and hence (5.15) is well defined. Similarly (5.16) is well defined. 

For the proof of the theorem, fix (^ > 0, let Q S and O £ 1) satisfy the assumptions, 
and define A := 1{{A) — S. Since ($(/,//) — \)D{f, fj.)[B] is bounded and measurable. 
Lemma 4.10 implies that the function q t-^ 9{q) := sup(j^^)g^-i/q)($(/, ;u) — X)0{f, fj,)[B] 
is ;B(supp(Q))-measurable. Moreover, (5.15) implies that the function 9 is non-negative 
with Q-probability one and (5.16) implies that 9 is strictly positive on a subset of strictly 
positive Q-measure. Hence, 



E,. 



sup mf,fi)-X)0{f,f,)[B] 
(/,M)e<E'-i(9) 
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Eq[9] > 0, 



and, therefore, 



sup Eg^ 

Qe£I,BeS) 



sup mf,fi)-X)B{f,fi)[B] 
{/,M)e*-i(<?) 



>0. 



It then foUows from Theorem 5.10 that ZY(^"^£} 0b D) > A = 1({A) - 6. Since 6 > 
was arbitrary, it follows that U[^~^£l Qb 2?) > h({A). Theorem 5.5 implies that 



U{^-\£1)Qb^) <U{A) 



and the theorem follows. 



8.9 Proof of Theorem 5.20 

Defining a function of interest $:=(<!> — mf' , we observe that the assumptions on <1> of 
Theorem 5.20 imply that those of Theorem 5.12 are satisfied for $. Therefore, applying 
Theorem 5.12 to $ yields 



sup E^0]D) [($ - m) 

7r0De>I'-l(Q)0sS 



B] 



sup ($(/,^) 

(/,M)e.4 



A2 



Since 



sup ($(/,//) 
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Uu{A)-m)' 
\{m-C{A)Y 






minimizing over m completes the proof. 

8.10 Proof of Theorem 6.1 

The proof follows from the proof of Theorem 5.12 as follows. Let (5 > 0, and let D £ !l) 
and a measurable section V' satisfy the assumptions. Define A := £2°°($ o ^) — J, and 
the universally measurable function q i— )• 0{q) := sup(j^^)g^-i(q)($(/, /u) — XfB^f , fi)[B]. 
Then assumption (6.1) implies that the function 6 is non- negative. It follows from the 
definition (6.14) of Q°°($ o V'), and A < 12°°($ o ^), that there is a Q G such that 
^ o ip > \ with nonzero Q-measure. Since 

e{q)= sup (1>(/,/i)-A)D(/,^i)[S] 
(/,M)evi'-i(9) 

>($oV'(g)-A)D(V^((?))[i?], 

the positivity assumption (6.2) implies that 6 is positive on a subset of positive Q- 
measure. Hence, 



E„ 



sup (<I>(/,/i)-A)D(/,/i)[B] 
(/,/.)e*-i(g) 



Eq[^] > 
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and, therefore, 



sup Kqr, 



sup (<!>(/, /x)-A)D(/,^) [5] 
(/,M)e*-i('?) 



>0. 



It then follows from Theorem 5.10 that 

U{^-^£1 0B 3) > A = £}°°($ o V) - 5. 
Since 6 > was arbitrary, the assertion is proved. 

8.11 Proof of Theorem 6.10 

We appeal to the corollary. Theorem 6.1, to Theorem 5.12. To that end, let A be defined 
as in (6.8), and let Q := Aq, ^ := Pq, D := {O"}. 
Since D" = D[J o P„ is a pull-back, 

= iPa7T.nUB2], 

from which we conclude that {Pair ■ Do)[i?^] > if and only if (tt • D")[i?^] > 0, and so 
conclude 

where Pa acts on each component in the natural way. Moreover since $ = <l>o o -Po is 
also a pull-back, for vr G 11, we have 



E^0B"[^|5, 



^ E(^^,^,)^4c|>(/il,/i2)D"(m,/^2)[i??]] 

'^ E(^^,^,)_[D«(^i,/.2)[i?5l] 

JE(Mi,M2)~^ ['^O ° ^a(/"l. /"2) • Pq" o Pa{f^l,fl2)m] 
E(^^,^,)_[D«oP„(;xi,^2)[i??]] 



and so we conclude that 

K{U Qb^ O") = K{Ua Qb^ %)■ 
We will now need the following proposition 



(8.10) 



Proposition 8.1. Consider B G S{X). Then for fi £ 7W(Af) such that ii{B) < I, we 
have 

dTv(^,M|BO < KB)- 
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Proof. For A G B{X), we have 

niAnB'' 



^i{A) - fi\Bc{A) = fi{A) 



fi{AnB' 



fi{AnB) + fi{AnB'') 



KB' 



and therefore 

KA) - filB^A) < fi{A nB)< fi{B) 

and 

KA) - f^lB^A) > -f^^fiA n B') 
KB) 

^ -T^KB-f^"" ^ 
= -KB) , 

thus estabUshing the assertion. D 

For B £ B{X) and ji G M{X) such that KB) < 1, the conditfonal measure /x|bc e 
M{X) is defined by 

Consider the total variation metric dxv on M{X) defined by 

dTv(Aii,/^2) := sup l^i(yl) -^2(^)1 

It follows from Proposition 8.1 that dTvifJ', fJ-ls'^) — KB) and since d^ < c^tv (see 
e.g. [(iN, Eq. 2.24]), we conclude that 

dM{l^,f^\B^)<KB). (8.11) 

Let Bs := i?5(xi) denote the ball about the first sample of x"" = (xi, . . . , Xn)- Then, 
for fiQ £ Ao, it follows from the assumptions that 

dM{fJ'0,IJ'0\Bl) < f^oiBs) 
<V°°{5) 
< a 

and therefore 



67 



Moreover, since 

a 

< fJ-olBilBs] 
= 0, 

we conclude that the condition (6.1) 

inf D"(/xo,/i'o)[i??]=0 

of Theorem 6.1 is satisfied for all /_io S ^o- 

Now consider the diagonal map A: M{X) —^ M{X) x M{X) defined by 

A(/i):=(^i,/i), fi£MiX). 

Since 

^ o A(/i) = Po o A(;u) = /u, for an/iG7W(A'), 

it follows, if we define A on the first component of the product M{X) x Ai{X) and then 
restrict to Aq, that A is a section of ^ = Pq- It is clearly measurable, but also satisfies 

Pq, o A(^i) = /i, for all ^ G A^(^), 

that is, Pq o A is the identity map from the first component of M{X) x Ai{X) to the 
second. Then, for /Uq & Aq, the positivity of the model V implies that 

D"(A(^o))[i??] = O^ o P4A(^o))[i??] 

n 

= l{fxo[Bs{xi)] 

1=1 
>0 

so that the second condition (6.2) of Theorem 6.1 is satisfied for all ^o £ -^o- Theorem 
6.1 then asserts that 

U{^~^Uo Qb^ O") > ng°($ o A). 

Moreover, since 

$ o A = <l>o o Pq o A = <I>o, 

now as a function on the first component of M{X) x Ai(X), and 

^-^no = Po-^no = n, 

we conclude that 

ZY(n©BniD)")>ng°($o). 

The identity U{U Qb^ O") = U{Ua Qb^ ICq) of (8.10) then imphes the assertion. 
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8.12 Proof of Theorem 7.1 

By the discussion at the beginning of Section 7.2, C{X) is a Pohsh evaluation measurable 
function space. By Theorem 7.10, such RKHSs or RKBSs are Polish evaluation mea- 
surable function spaces. Also, Theorem 7.12 proves that UC(Af) is a Polish evaluation 
measurable function space. The assertion then follows from Corollary 7.7. 

8.13 Proof of Lemma 7.2 

By [ , Thm. 13.11, pg. 84] (see also [4, Thm. 4.59]), there exists a Polish topology r* 
on X such that 

B{t*) = B{tx) 

and 

/:(X,r*)^(3;,r3;) 

is continuous. Then, by [±, Thm. 15.14], 

U: M{B{t*)) ^ M{B{Ty)) 

is continuous. Since B{t*) = B{tx), it follows that M{B{t*)) = M{B{tx)), and the 
result follows. 

8.14 Proof of Proposition 7.3 

The first assertion follows by observing that the product metric makes the product of 
complete spaces complete. The assertion that J is measurable follows from Theorem 7.5, 
the fact that point evaluation on B{X) is continuous and therefore measurable, and that 
B{J^{X)) X B{M{X)) C B{T{X) X M{X)). 

8.15 Proof of Theorem 7.5 

Let us begin by proving the "if" part. To prove measurability in the product structure, 
we would like to show that J is a Caratheodory function (see e.g. [I, Def. 4.50]), meaning 
that fixing / it is continuous and fixing /i it is measurable. In that case, since X is 
Polish, it follows that Ai{X) is Polish, and since {J-,(t{J-)) is measurable and 7W(M) is 
metrizable, it follows from [ ., Lem. 4.51] that J is measurable. 

To that end, observe that Lemma 7.2 implies that the map /* : M{X) — t- A^(M) is 
continuous, so it remains to show that, for fi G M[X), the map /i*: J-" — )• A^(M) defined 

by 



is measurable. Consider a Dirac mass fi = 6x for x €z X. Then, 
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In order to prove that 5^. is measurable, we metrize the weak-* topology on A^(M), which 
by [48, Prop. 11.3.3] can be accomplished with the bounded Lipschitz metric b defined 
by 

b{ui,V2):= sup I hd{ui-U2), ui,U2eM{R), (8.12) 

1|/i1|bl<1-^]R 

defined in terms of any metric d on M that generates the same topology as the standard 
metric. By [■' , Thm. 2.8.2], there is a totally bounded metrization e of M such that 
the completion with respect to that metric is compact. Choose such a metric e for 
the definition of the metric b in (8.12). Now the space BL(M,e) of bounded Lipschitz 
functions on (M, e) is isomorphic to the bounded Lipschitz functions on the completion 
with respect to e. However, this completion space is compact, and the Arzela-Ascoli 
Theorem [7, Thm. A8.5] implies that the unit ball in the space of bounded Lipschitz 
functions on the completion is compact. Consequently, it follows that the unit ball in 
BL(M, e) is also compact. With this metrization, [4, Lem. 4.30] implies that to prove 
that 5* is measurable it is sufficient to prove that 

f^bi6if,^^) 

is measurable for all /i G A^(M). 

Consequently, the compactness of the unit ball of BL(M, e) and the continuity of 
f. B{M.) — > M defined by u{h) := J^hdi/ for all u G M{M.) together imply that there 
exists a countable subset {hi \ i £ N} of the unit ball of BL(M, e) such that 



&(4/>^) :=sup / hid{6if-fi2) 

ieN JR 



= sup / hi d((5/(a.) - ^2) 
= sup ( hi{f{x)) - / hid^l■, 

ieN \ JR 

Consequently, if each function 

f^hi{f{x)) (8.13) 

is measurable then it follows from [48, Thm. 4.2.2] — which states that the point- 
wise limit of a convergent sequence of measurable functions is measurable — that 5* 
is measurable. However, the assumption that [F^a{F)) is evaluation measurable, the 
measurability of /i, and the fact that the composition of measurable functions is mea- 
surable implies that the function (8.13) is measurable. Therefore, 5* is measurable for 
each X £ X . Hence, //* is measurable for all measures /i with finite support. Since, by [4, 
Thm. 15.10], the measures with finite support are dense in M{X), the continuity as a 
function of /x for fixed / (assured by Lemma 7.2) implies that /x*, for arbitrary ^ G M.{X), 
is a pointwise convergent limit of a sequence of measurable functions into a metric space, 
and therefore by [4N, Thm. 4.2.2] is measurable. Therefore, J is a Caratheodory function 
and the "if" part of the assertion is established. 



70 



For the "only if" part, observe that, if J is measurable in the product structure, then, 
ior X £ X and the corresponding Dirac mass Sx G M.{X), it follows that J( • , Sx) : J^i^) — ^ 
A1(M) is measurable. By [4, Thm. 15.8], the map r >-^ 5r is an embedding of M into 
A^(M). Then, since J{f,5x) = ^f(x)^ composition with the inverse of this embedding 
yields that / i— )• f{x) is measurable for all x £ X. That is, {T{X), <j{J-)) is an evaluation 
measurable function space. 

8.16 Proof of Corollary 7.6 

By Theorem 7.5, J is measurable, and, by Lemma 7.2, h-t : A^(M) — t- 7W(M) is continuous. 
Therefore, since 

Jh{f,fJ') = {ho f)^fi = h^f^fi = h^J{f,fi), 

it follows that Jfi = h^, o J and the assertion follows. 

8.17 Proof of Corollary 7.7 

Since X is Polish, it follows that M{X) is Polish, and, since {J-'{X),t{J-')) is Polish, it 
follows from [48, Prop. 2.1.4] that both {J'{X),t{T)) and M.{X) are second countable. 
Therefore, by [48, Prop. 4.1.7], the product of the Borel structures is the Borel structure 
of the product. That is, 

B{t{T)) X B{MiX)) = B{J^{X) X Mix)). 

The assertion then follows from Theorem 7.5. 

8.18 Proof of Lemma 7.8 

Let us begin with the "if" part. To that end, let us first show that condition (7.4) implies 
that $(Af) is separable. Indeed, fix e > and for each ^, fc G N, let Bj, j G N, denote 
the corresponding partition and let x^ G B^ denote a selection. The set 

{$(4) I fcGN,iGN} 

is countable, and it is easy to show it is dense in ^{X). Hence, ^(X) is separable. 
Consequently, span{<l>(Af)} is separable, and, therefore, W := span{$(Af)} is separable. 
It then follows that B = {(n, <&*( • ))} with norm ||(u, <!>*(•)) || = ||u||w is separable. 

For the "only if" part, suppose that B is separable. Then the feature space W := -B is 
separable and, since B is metric, it follows for the corresponding feature map $ : Af — )• i? 
that ^{X) C 5 is separable. Therefore, there exists a countable dense set {^(xj) \ j G N} 
of ^(X). Therefore, if for each e > and for each j G N we define 

Br.= {x€x\mx,)-^x)\\B<'^}, 

it follows that UjeN-^i ~ '^ ^^^ ll^(^i) ~ *^(^2)||b < e for all xi,X2 G Bj. 
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8.19 Proof of Lemma 7.9 

As in the proof of Lemma 7.4, it is sufficient to prove that the the range of the feature 
map is separable. First recah that cite[Thm. 2]Frohk:1963 has shown that bianalyticity is 
equivalent to being separable and absolutely Borel. Then, observe that Stone's Theorem 
[106, Thm. 16, pg. 32] states that when Af is a separable absolutely Borel space and 
$ : A" —7- 3^ is a measurable bijection between X and a metric space y, then the image 
^{X) = 3^ is separable. However, bijectivity of ^ is in fact not necessary. To see 
this, select a metric on X that generates its topology and extend to the injective map 
^■. X ^ X X y defined by 4>(x) := {x,^{x)), where, in the product, 4> is measurable. 
Then, by restricting to the range, with its inherited metric, we obtain that the resulting 
$ is bijective, and so the range of <& is separable by Stone's Theorem. Since the range of 
$ is the continuous image of the range of ^ and separability is preserved by continuous 
maps, the range of $ is separable. 

8.20 Proof of Theorem 7.10 

Theorem 12 of [57] shows that X is absolutel Borel when it is Polish, and therefore it is 
also bianalytic. Consequently, Lemma 7.9 implies the separability of /C. Since Banach 
and Hilbert spaces are complete metric spaces, it follows that fC is Polish. Moreover, the 
measurability of a feature map for a RKHS or the measurability of a dual feature map 
for the RKBS guarantees that the space consists of measurable functions. Since point 
evaluation is continuous it is measurable, which completes the proof. 

8.21 Proof of Theorem 7.12 



Let us first show that (\JC(X),B[t\y(\]C))) is Polish. To that end, recall that Beer [15] 
has shown that (CL(Af x R),tw) is Polish. Therefore, if hypo{UC{X)) C CL{X x M) 
is closed, it follows that hypo(UC(A')) is Polish. Since hypo is one-to-one, it follows 
from [ -] that with the pullback topology T]y(\JC) hypo is an embedding, and there- 
fore {\]C{X), B{tw {\JC))) is Polish. So let us show that hypo(UC(A')) is closed. Let 
hypo(/j) — )• A and suppose that A is not a hypograph. Then there exist x £ X and 
ri < r2 < r^ such that {x,ri) G A, {x,r2) ^ A and (x,rfc) G A. Consequently, by [16, 
Proposition, pg. 80] (which Beer credits to Del Prete and Lignola [ ]), it follows that 
if we select small enough non overlapping open cylinders Bi,B2, -B3 with the same base 
about (x,ri), (x,r2), (x,rj3) respectively, then there exists N £N such that that, for all 
n> N, hypo(/„)ni?i 7^ 0, hypo(/ri,)ni?2 = 0) and hypo(/n)ni?3 7^ 0. Since the bases 
of the cylinders are the same, this contradicts the fact that hypo(/„) is a hypograph. 
Therefore, A is a hypograph and so hypo(UC(A')) C CL{X x M) is closed and we have 
proved that hypo{UC{X)^ is Polish. 

Now let us show it is an evaluation measurable function space. It follows from the 
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proof of [ , Thm. 6.1] that if hypo{fn) — >■ hypo{f) then we have 

/ = hmhmsupp/„ 

P ni— >oo 

> Hm hm sup fn 

P ni— >oo 



and, therefore, for ah x £ X, 
or written in another way 



= hmsup/n 

ni—^oo 

f{x) > limsup/„(x) 

ni— >oo 

ix{f) > limsupixifn 



where ix is point evaluation. Therefore it follows from the alternative characterization of 
upper semicontinuity [i, Lem. 2.42] that point evaluation is upper semicontinuous in the 
pull-back topology tw(UC) and, therefore, measurable with respect to the corresponding 
Borel (T-algebra. 

9 Appendix 

The following lemma is Lemma III. 39 p. 86 of [ i]. We also refer to p. 87 of [34] for 
the existence of the measurable selection r] (which is also derived from Theorem III. 38 
p. 85 of [•)[]). These results are related to Aumann's measurable section principle [ ] (the 
extension to Suslin space is due to Sainte-Beuve ['' ']). 

Lemma 9.1. Let (T,T) be a measurable space, S a Suslin space, ip: T x S ^ R a 
T ® B{S) measurable function and V a multifunction (i.e. a set-valued map) from T to 
non-empty subsets of S whose graph G belongs to T x B{S). Then 

1. the function 

m{t) := sup{</)(t, x) I X G r(t)} 

is a'T measurable function oft. 

2. for (5 > 0, there exists r], a'T measurable function of t, such that r]{t) G r(t) and 
ip{t,r]{t)) > m{t) - 5. 

The following definition is Definition 4.50 in [4]: 

Definition 9.2. Let (S, S) be a measurable space, and let X and Y be topological 
spaces. A function h: S x X ^Y \s a Caratheodory function if: 

1. for each x G X, the function h^ = h{.,x) : S" — ^ y is (S, i3(y))-measurable; and 

2. for each s £ S, the function hg = h{s, .) : X ^ Y is continuous. 

The following lemma is Lemma 4.51 in [ ] (see also [34, p. 70]): 

Lemma 9.3. Let {S, S) be a measurable space, X a separable metrizable space, and Y a 
metrizable space. Then every Caratheodory function h: SxX ^Y is jointly measurable. 
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9.1 Universally measurable functions 

For a topological space T let B(T) denote the cr-algebra of universally measurable sets. 
For a measure //, let fi denote its completion. Here we state the following proposition 
that allows us to define the expected value of B{T) measurable functions with respect 
to Borel measures. In all statements in the following proposition, the assertions follow 
when the integrals involved exist, in particular for semibounded functions. The proof is 
straightforward but tedious and follows from e.g. [ I!"", Thm. pg. 37], the English version 
of [41, Ch. 2, pg. 49], and [' ]. 

Proposition 9.4. Let T be a topological space. Then we have 

• For a measurable function f we have E^/ = E^/ 

• Let f be B{T) -measurable. Then there exist two measurable functions f and f such 
that 

l<f<l, M//7) = o 

and, for any such functions, we have 

E,[l]=E^[f]=E,[J] 

• For a fixed fj,, f >-^ ^fi[f] defines an affine function on the cone of non-negative 
B{T) -measurable functions 

• For a fixed B(T) -measurable function f, the function A4{T) 9 ^u i— t- E^[/] is affine. 

• Suppose that /i, /2 are B{T) -measurable non-negative functions such that /i < /2. 
Then E/i[/i] < ^^[/a] for all fieM{T). 

Proposition 9.4 leads to the following definition for the expectation of i3(T)-measurable 
functions with respect to Borel probability measures on T: 

Definition 9.5. For a Borel probability measure /i G M(T), we define the integral of a 
;B(T)-measurable function / by 

lE^m :=%[/] 

when the latter exists, where fi is the completion of the measure // as described in [45, 
p. 37]. 

Recall that a carrier T for a probability measure Q € A4{Q) is a set T G B{Q) such 
that Q{T) = 1. For a carrier T, since T G B{Q), it follows that B{T) = B{Q) n T and 
we can define the trace measure Qr G M{T) by Qt{A) := Q{A),A £ B{Q) n T. The 
following proposition shows that the expectation of a function can be defined with respect 
to measures which possess carriers upon which the function is universally measurable: 
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Proposition 9.6. Let S be a topological space. Suppose that f is B{T) -measurable for 
all measurable T <Z S. Suppose also that Q G Ai{S) has a carrier T C 5". Then, using 
Definition 9.5, any such carrier T defines an expectation 

and this definition is independent of the carrier; that is, if T' C S is another carrier, 
then 

Moreover, this expectation satisfies the assertions of affinity and monotonicity of Propo- 
sition 9.4- 

We also need a change of variables formula for expectations of universally measurable 
functions. 

Proposition 9.7. Let X and Y be topological spaces, ^ : X ^Y a measurable map and 
suppose that f : Y ^- M. is B{Y) measurable. Then / o ^: X — t- M is B{X) -measurable 
and, for vr G M.{X), 

For Suslin space X and a subset M C M{X) let S(M) denote the smallest a- 
subalgebra of subsets of M for which the the evaluation map u i— ?• v{B) is measurable 
for all B G B{X). The following version of a result of Weizsacker and Winkler [111] as 
stated in [ i - , Thm. 3.1] will be useful to us: 

Theorem 9.8. Consider a Suslin space X , n real valued measurable functions fi'. X ^ 
R, n constants ci, . . . ,Cn G M, and define 

-ff := {i^ G Ai{X) I fi is u-integrable and IE,y[/j] < Cj, for i = 1, . . . , n| 

Then, for each i^ £ H, there is a probability measure p on ^{ext{H)) such that 

u{B)= [ v\B)dp{u'), for allB£B{X). (9.1) 

[121, Prop. 3.1] shows that if a measurable function /: Af — )• M is integrable with 
respect to all measures in H (allowing the values oo and — oo), then integration 



F{u):= [ fdu 
Jx 



is measure afiine per Definition 4.3. We need a slightly more general result: 

Lemma 9.9. Consider the situation of Theorem 9.8, let f : X ^ M. be a semibounded 
universally measurable function. Then 

F{u):=Eo[f], foruGH, 

is measure affine per Definition 4-3. 
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The next lemma extends [j-j , Thm. 2.1] to the case where the constraint functions 
fi, for i = 1, . . . , n, are miiversahy measurable: 

Lemma 9.10. Let X he Suslin, and fix universally measurable real-valued functions 
fi, . . . , fn and constants ci , . . . , c„ . Then 

-ff := {i^ G ^A{X) I fi is v-integrahle and ^u[fi] < Ci for i = 1, . . . , n} (9.2) 

is convex and 



ext(-ff) = lu £ H V = N^Oj^a;. , Oj > 0, Xj G A',i = 1, . . . ,771, \^ Oj = 1, 1 < m < n + 1, 

the vectors[fi{xi), f2{xi), . . . , fn{xi), l), 1 < z < m are linearly independent > . 

9.2 Proofs 

9.2.1 Proof of Proposition 9.6 

Let T and T' be two carriers for Q € A^ (5) and / a function such that /t and /y are 
B{t) and B{T') measurable respectively. Then Proposition 9.4 implies that there are 
functions /i , /2 measurable on T and /( , /g measurable on T' such that 

/l < /t < /2 Qt(/i / /2) = 

f'l<fT'<f^ Qt'(/(//2)=0 

SO that 

Now, it is easy to see that T n T' is also a carrier and that we have 

fi{x) < fix) < f2{x), xeTnr 

and 

QrnT'(/i//2)<Qr(/i//2) = 

so that we conclude from Proposition 9.4 that 



^Qtht 


if] 


= 


^QrnT' [/l] 








= 


Eq, 


[/l] - ^Qt\t 


[fl 






= 


Eq, 


[fl] 








= 


%. 


[f] 






H 


rnT 


,[/] = 


-^qJI]- 





and so conclude that 



By the same argument on T' we conclude that Ea [/] = Ea [/] and therefore the first 
assertion is proved. The assertions of affinity and monotonicity are similarly straight- 
forward. 
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9.2.2 Proof of Proposition 9.7 

Consider vr € Ai{X) and its pushforward i/ := ^vr. By Proposition 9.4 and the assump- 
tions, there exists two measurable functions / and / such that 

from which we conclude that 

fo^<fo^!<fo^! 



and 



so that we obtain 



= 7r[vI/-i{//7}] 
7r[/o^/7o^] =0. 



Since tt was arbitrary, it follows that / o ^ is ;B(X)-measurable. To obtain the change 
of variables formula, compute 

= ]E*.[7] 
and 

= IEv&.[7] 
from which we conclude the change of variables formula 

9.2.3 Proof of Lemma 9.9 

Fix u ^ H and a probability measure p such that the barycentric formula (9.1) holds. 
Proposition 9.4 asserts that there are measurable functions /i < / < /2 such that 
y{h + h) = 0. Therefore, /a - / > 0, E^(/2 - /) = 0, / - /i > 0, and E^(/ - /i) = 0. 
Moreover, it is easy to see then we can make both /i and /2 semibounded. Therefore F 
is a well defined extended real valued function. Moreover, [121, Prop. 3.1] asserts that 
the function i/ 1— t- Ei^[/j] is measure affine for i = 1,2, and so 



E.[/.]= / E,4/,]dp(z.'), fori = 1,2. 

Jex^UH) 



/ext(H) 
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Consequently, since i/[/i 7^ /2] = 0, it follows that ^u[f2 — /i] = so that 
= E,[/2 - /i] = / K'[f2 - /i] dp{u'), for i = 1, 2, 

Jext{H) 

and since /2 — /i > it follows that 

i^'[/2 / /i] = 0, p-a.e. 

and therefore 

'^"'[///i]=0, p-a.e. 

Therefore we conclude that 

IE.'[/i]dp(i/') 

ext{H) 

E-,[h]dp{u') 

ext{H) 

E-,[h]dp{u')+ [ Ep[/-/i]dp(z.') 

ext{H) Jext{H) 

ext{_H') 

ext{H) 

and the assertion is proved. 

9.2.4 Proof of Lemma 9.10 

Let us first establish that 

J^T+^2 = J^i + 1^2, for ah 1^1, U2^ M{X), (9.3) 

m> = Qz>, for ahz^ G A^(A'). (9.4) 

This follows from the fact that (z^i + V2){N) = if and only if Vj{N) = for j = 1, 2 
and the characterization of the completion i) by 

i){BUS):=u{B), B el3{X), S C N,u{N)=0 
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as found, for example, in [ , p. 18]. For then, for such B and S, 

i^r+^2{B U 5) = {ui + U2){B) 
= vi{B) + U2{B) 
= fi{B U S) + f2{B U S) 

Now for the proof of the main assertion. Fohowing the proof of [121, Thm. 2.1], it is 
sufficient to show that for 

K := {z^ G ^A{X) I fi is z)-integrable for i = 1, . . . , n|, 

we have 

ext(K) :={5^,x G A"}, (9.5) 

and that M^K C M^Ai{X) is a lattice cone in its own ordering. For the first, observe 
that since ext(A^(^)) = {Sx \ x E Af} and that fi are (5a;-integrable for all i = 1, . . . , n, 
j; G Af, it follows that 

{5x\xeX}<Z ext(K). 

Now suppose that z^ £ ext(-ftr) is not a Dirac mass. Then, as in the proof that the extreme 
points of M{X) are the Dirac masses, see e.g. [4, Thm. 15.9], and using the fact that 
the support of z/ must contain 2 or more points, we can decompose u = av\ + (1 — a)f2 
where v\ ^ vi and a G (0, 1). Moreover, from 

V = av\ + (1 — 0)2^2 

we conclude that fi being i>-integrable implies that fi is z^j-integrable for j = 1,2 and 
i = l,...,n. Consequently, Vj G K for j = 1,2. Since v was an extreme point we 
conclude that v\ = vi which is a contradiction, and (9.5) follows. 

Now let us demonstrate that M+iT is a lattice cone in its own ordering. To that end, 
note that by [S9, Lem. 10.4], it suffices to show that M+i^ C M+A^(A:') is a hereditary 
subcone, in that v\ G M+ii', V2 G M+A^(A') and v\ — V2^ M+ET together imply that v^ G 
M+i^. To that end, consider such v\ and v^- Then (9.3) implies that {y\ — v^) = vx — v^ 
and so we conclude that 

0<E^^^^^^[|/,|]=E,-J|/,|]-E,-,[|/,|] 

and therefore 

E.~,[|/i|]<E,-J|/,|]<oo, 

from which we conclude that z/2 G M+X. Hence, M+ii' is a hereditary subcone, and the 
assertion then follows as in the proof of [l21, Thm. 2.1]. 
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